[tip:irq/core] genirq/affinity: Add is_managed to struct irq_affinity_desc
Commit-ID: c410abbbacb9b378365ba17a30df08b4b9eec64f Gitweb: https://git.kernel.org/tip/c410abbbacb9b378365ba17a30df08b4b9eec64f Author: Dou Liyang AuthorDate: Tue, 4 Dec 2018 23:51:21 +0800 Committer: Thomas Gleixner CommitDate: Wed, 19 Dec 2018 11:32:08 +0100 genirq/affinity: Add is_managed to struct irq_affinity_desc Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. That limitation was reported by Kashyap and Sumit. Expand struct irq_affinity_desc with a new bit 'is_managed' which is set for truly managed interrupts (queue interrupts) and cleared for the general device interrupts. [ tglx: Simplify code and massage changelog ] Reported-by: Kashyap Desai Reported-by: Sumit Saxena Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: linux-...@vger.kernel.org Cc: shivasharan.srikanteshw...@broadcom.com Cc: ming@redhat.com Cc: h...@lst.de Cc: bhelg...@google.com Cc: douliya...@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-3-douliya...@gmail.com --- include/linux/interrupt.h | 1 + kernel/irq/affinity.c | 4 kernel/irq/irqdesc.c | 13 - 3 files changed, 13 insertions(+), 5 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index c44b7844dc83..c672f34235e7 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -263,6 +263,7 @@ struct irq_affinity { */ struct irq_affinity_desc { struct cpumask mask; + unsigned intis_managed : 1; }; #if defined(CONFIG_SMP) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index c0fe591b0dc9..45b68b4ea48b 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -289,6 +289,10 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) for (; curvec < nvecs; curvec++) cpumask_copy(&masks[curvec].mask, irq_default_affinity); + /* Mark the managed interrupts */ + for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++) + masks[i].is_managed = 1; + outnodemsk: free_node_to_cpumask(node_to_cpumask); return masks; diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index cb401d6c5040..ee062b7939d3 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -453,27 +453,30 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node, struct module *owner) { struct irq_desc *desc; - unsigned int flags; int i; /* Validate affinity mask(s) */ if (affinity) { - for (i = 0; i < cnt; i++) { + for (i = 0; i < cnt; i++, i++) { if (cpumask_empty(&affinity[i].mask)) return -EINVAL; } } - flags = affinity ? IRQD_AFFINITY_MANAGED | IRQD_MANAGED_SHUTDOWN : 0; - for (i = 0; i < cnt; i++) { const struct cpumask *mask = NULL; + unsigned int flags = 0; if (affinity) { - node = cpu_to_node(cpumask_first(affinity)); + if (affinity->is_managed) { + flags = IRQD_AFFINITY_MANAGED | + IRQD_MANAGED_SHUTDOWN; + } mask = &affinity->mask; + node = cpu_to_node(cpumask_first(mask)); affinity++; } + desc = alloc_desc(start + i, node, flags, mask, owner); if (!desc) goto err;
[tip:irq/core] genirq/core: Introduce struct irq_affinity_desc
Commit-ID: bec04037e4e484f41ee4d9409e40616874169d20 Gitweb: https://git.kernel.org/tip/bec04037e4e484f41ee4d9409e40616874169d20 Author: Dou Liyang AuthorDate: Tue, 4 Dec 2018 23:51:20 +0800 Committer: Thomas Gleixner CommitDate: Wed, 19 Dec 2018 11:32:08 +0100 genirq/core: Introduce struct irq_affinity_desc The interrupt affinity management uses straight cpumask pointers to convey the automatically assigned affinity masks for managed interrupts. The core interrupt descriptor allocation also decides based on the pointer being non NULL whether an interrupt is managed or not. Devices which use managed interrupts usually have two classes of interrupts: - Interrupts for multiple device queues - Interrupts for general device management Currently both classes are treated the same way, i.e. as managed interrupts. The general interrupts get the default affinity mask assigned while the device queue interrupts are spread out over the possible CPUs. Treating the general interrupts as managed is both a limitation and under certain circumstances a bug. Assume the following situation: default_irq_affinity = 4..7 So if CPUs 4-7 are offlined, then the core code will shut down the device management interrupts because the last CPU in their affinity mask went offline. It's also a limitation because it's desired to allow manual placement of the general device interrupts for various reasons. If they are marked managed then the interrupt affinity setting from both user and kernel space is disabled. To remedy that situation it's required to convey more information than the cpumasks through various interfaces related to interrupt descriptor allocation. Instead of adding yet another argument, create a new data structure 'irq_affinity_desc' which for now just contains the cpumask. This struct can be expanded to convey auxilliary information in the next step. No functional change, just preparatory work. [ tglx: Simplified logic and clarified changelog ] Suggested-by: Thomas Gleixner Suggested-by: Bjorn Helgaas Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: linux-...@vger.kernel.org Cc: kashyap.de...@broadcom.com Cc: shivasharan.srikanteshw...@broadcom.com Cc: sumit.sax...@broadcom.com Cc: ming@redhat.com Cc: h...@lst.de Cc: douliya...@huawei.com Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliya...@gmail.com --- drivers/pci/msi.c | 9 - include/linux/interrupt.h | 14 -- include/linux/irq.h | 6 -- include/linux/irqdomain.h | 6 -- include/linux/msi.h | 4 ++-- kernel/irq/affinity.c | 22 -- kernel/irq/devres.c | 4 ++-- kernel/irq/irqdesc.c | 17 + kernel/irq/irqdomain.c| 4 ++-- kernel/irq/msi.c | 8 10 files changed, 55 insertions(+), 39 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 265ed3e4c920..7a1c8a09efa5 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -534,14 +534,13 @@ error_attrs: static struct msi_desc * msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd) { - struct cpumask *masks = NULL; + struct irq_affinity_desc *masks = NULL; struct msi_desc *entry; u16 control; if (affd) masks = irq_create_affinity_masks(nvec, affd); - /* MSI Entry Initialization */ entry = alloc_msi_entry(&dev->dev, nvec, masks); if (!entry) @@ -672,7 +671,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base, struct msix_entry *entries, int nvec, const struct irq_affinity *affd) { - struct cpumask *curmsk, *masks = NULL; + struct irq_affinity_desc *curmsk, *masks = NULL; struct msi_desc *entry; int ret, i; @@ -1264,7 +1263,7 @@ const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) for_each_pci_msi_entry(entry, dev) { if (i == nr) - return entry->affinity; + return &entry->affinity->mask; i++; } WARN_ON_ONCE(1); @@ -1276,7 +1275,7 @@ const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) nr >= entry->nvec_used)) return NULL; - return &entry->affinity[nr]; + return &entry->affinity[nr].mask; } else { return cpu_possible_mask; } diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index ca397ff40836..c44b7844dc83 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -257,6 +257,14 @@ struct irq_affinity { int *sets; }; +/** + * struct irq_affinity_desc - Interrupt affinity descriptor + *
Re: [PATCH 3/3] irq/affinity: Fix a possible breakage
Hi tglx, on 2018/12/5 16:28, Thomas Gleixner wrote: On Tue, 4 Dec 2018, Dou Liyang wrote: In case of irq_default_affinity != cpu_possible_mask, setting the affinity for the pre/post vectors to irq_default_affinity is a breakage. Why so? All interrupts which are not managed get te default affinity mask. It can be different than cpu_possible_mask, but that's what the admin has set. The affinity of these non-managed interrupts can still be set via /proc/... so where is the breakage? I misunderstood it. please ignore this, ;-) Thanks, dou
[PATCH 2/3] irq/affinity: Add is_managed into struct irq_affinity_desc
Now, Linux uses the irq_affinity_desc to convey information. As Kashyap and Sumit reported, in MSI/-x subsystem, the pre/post vectors may be used to some extra reply queues for performance. https://marc.info/?l=linux-kernel&m=153543887027997&w=2 Their affinities are not NULL, but, they should be mapped as unmanaged interrupts. So, only transfering the irq affinity assignments is not enough. Add a new bit "is_managed" to convey the info in irq_affinity_desc and use it in alloc_descs(). Reported-by: Kashyap Desai Reported-by: Sumit Saxena Signed-off-by: Dou Liyang --- include/linux/interrupt.h | 1 + kernel/irq/affinity.c | 7 +++ kernel/irq/irqdesc.c | 9 +++-- 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 71be303231e9..a12b3dbbc45e 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -263,6 +263,7 @@ struct irq_affinity { */ struct irq_affinity_desc { struct cpumask mask; + unsigned intis_managed : 1; }; #if defined(CONFIG_SMP) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 1562a36e7c0f..d122575ba1b4 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -289,6 +289,13 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) for (; curvec < nvecs; curvec++) cpumask_copy(&masks[curvec].mask, irq_default_affinity); + /* Setup complementary information */ + for (i = 0; i < nvecs; i++) { + if (i >= affd->pre_vectors && i < nvecs - affd->post_vectors) + masks[i].is_managed = 1; + else + masks[i].is_managed = 0; + } outnodemsk: free_node_to_cpumask(node_to_cpumask); return masks; diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index f87fa2b9935a..6b0821c144c0 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -455,7 +455,7 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node, const struct irq_affinity_desc *cur_affinity= affinity; const struct cpumask *mask = NULL; struct irq_desc *desc; - unsigned int flags; + unsigned int flags = 0; int i; /* Validate affinity mask(s) */ @@ -468,11 +468,16 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node, } } - flags = affinity ? IRQD_AFFINITY_MANAGED | IRQD_MANAGED_SHUTDOWN : 0; mask = NULL; for (i = 0; i < cnt; i++) { if (affinity) { + if (affinity->is_managed) { + flags = IRQD_AFFINITY_MANAGED | + IRQD_MANAGED_SHUTDOWN; + } else { + flags = 0; + } mask = &affinity->mask; node = cpu_to_node(cpumask_first(mask)); affinity++; -- 2.17.2
[PATCH 3/3] irq/affinity: Fix a possible breakage
In case of irq_default_affinity != cpu_possible_mask, setting the affinity for the pre/post vectors to irq_default_affinity is a breakage. Just set the pre/post vectors to cpu_possible_mask and be done with it. Suggested-by: Thomas Gleixner Signed-off-by: Dou Liyang --- kernel/irq/affinity.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index d122575ba1b4..aaa1dd82c3df 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -257,7 +257,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) /* Fill out vectors at the beginning that don't need affinity */ for (curvec = 0; curvec < affd->pre_vectors; curvec++) - cpumask_copy(&masks[curvec].mask, irq_default_affinity); + cpumask_copy(&masks[curvec].mask, cpu_possible_mask); /* * Spread on present CPUs starting from affd->pre_vectors. If we * have multiple sets, build each sets affinity mask separately. @@ -282,12 +282,15 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) } /* Fill out vectors at the end that don't need affinity */ - if (usedvecs >= affvecs) + if (usedvecs >= affvecs) { curvec = affd->pre_vectors + affvecs; - else + } else { curvec = affd->pre_vectors + usedvecs; + for (; curvec < affd->pre_vectors + affvecs; curvec++) + cpumask_copy(&masks[curvec].mask, irq_default_affinity); + } for (; curvec < nvecs; curvec++) - cpumask_copy(&masks[curvec].mask, irq_default_affinity); + cpumask_copy(&masks[curvec].mask, cpu_possible_mask); /* Setup complementary information */ for (i = 0; i < nvecs; i++) { -- 2.17.2
[PATCH 1/3] genirq/core: Add a new interrupt affinity descriptor
Now, Linux just spreads the interrupt affinity info by a cpumask pointer and mark it as managed interrupt if its cpumask is not NULL. if there are some other info should be passed, this design is not good to expand, adding new arguments is the most staightforward method, But this will break many functions. So, add a new interrupt affinity descriptor, replace the cpumask pointer with its pointer which allows to expand this in the future without touching all the functions ever again, Just modify the data irq_affinity_desc structure. No functional change, just prepares for support of spreading managed flags. Suggested-by: Thomas Gleixner Suggested-by: Bjorn Helgaas Signed-off-by: Dou Liyang --- drivers/pci/msi.c | 9 - include/linux/interrupt.h | 14 -- include/linux/irq.h | 6 -- include/linux/irqdomain.h | 6 -- include/linux/msi.h | 4 ++-- kernel/irq/affinity.c | 22 -- kernel/irq/devres.c | 4 ++-- kernel/irq/irqdesc.c | 16 ++-- kernel/irq/irqdomain.c| 4 ++-- kernel/irq/msi.c | 7 --- 10 files changed, 56 insertions(+), 36 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 265ed3e4c920..7a1c8a09efa5 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -534,14 +534,13 @@ static int populate_msi_sysfs(struct pci_dev *pdev) static struct msi_desc * msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd) { - struct cpumask *masks = NULL; + struct irq_affinity_desc *masks = NULL; struct msi_desc *entry; u16 control; if (affd) masks = irq_create_affinity_masks(nvec, affd); - /* MSI Entry Initialization */ entry = alloc_msi_entry(&dev->dev, nvec, masks); if (!entry) @@ -672,7 +671,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base, struct msix_entry *entries, int nvec, const struct irq_affinity *affd) { - struct cpumask *curmsk, *masks = NULL; + struct irq_affinity_desc *curmsk, *masks = NULL; struct msi_desc *entry; int ret, i; @@ -1264,7 +1263,7 @@ const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) for_each_pci_msi_entry(entry, dev) { if (i == nr) - return entry->affinity; + return &entry->affinity->mask; i++; } WARN_ON_ONCE(1); @@ -1276,7 +1275,7 @@ const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) nr >= entry->nvec_used)) return NULL; - return &entry->affinity[nr]; + return &entry->affinity[nr].mask; } else { return cpu_possible_mask; } diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index ca397ff40836..71be303231e9 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -257,6 +257,14 @@ struct irq_affinity { int *sets; }; +/** + * struct irq_affinity_desc - Interrupt affinity descriptor + * @mask: It's one cpumask per descriptor. + */ +struct irq_affinity_desc { + struct cpumask mask; +}; + #if defined(CONFIG_SMP) extern cpumask_var_t irq_default_affinity; @@ -303,7 +311,9 @@ extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m); extern int irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify); -struct cpumask *irq_create_affinity_masks(int nvec, const struct irq_affinity *affd); +struct irq_affinity_desc * +irq_create_affinity_masks(int nvec, const struct irq_affinity *affd); + int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity *affd); #else /* CONFIG_SMP */ @@ -337,7 +347,7 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify) return 0; } -static inline struct cpumask * +static inline struct irq_affinity_desc * irq_create_affinity_masks(int nvec, const struct irq_affinity *affd) { return NULL; diff --git a/include/linux/irq.h b/include/linux/irq.h index c9bffda04a45..def2b2aac8b1 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -27,6 +27,7 @@ struct seq_file; struct module; struct msi_msg; +struct irq_affinity_desc; enum irqchip_irq_state; /* @@ -834,11 +835,12 @@ struct cpumask *irq_data_get_effective_affinity_mask(struct irq_data *d) unsigned int arch_dynirq_lower_bound(unsigned int from); int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node, - struct module *owner, const struct cpumask *affinity); + struct module *owner, + const struct irq_affinity_desc *af
[PATCH 0/3] irq/core: Fix and expand the irq affinity descriptor
Now, Spreading the interrupt affinity info by a cpumask pointer is not enough, meets a problem[1] and hard to expand in the future. Fix it by: +---+ | | | struct cpumask *affinity | | | +---+ | +--v---+ | | | struct irq_affinity_desc { | | struct cpumask mask; | | unsigned int is_managed : 1; | | }; | | | +--+ [1]:https://marc.info/?l=linux-kernel&m=153543887027997&w=2 Dou Liyang (3): genirq/affinity: Add a new interrupt affinity descriptor irq/affinity: Add is_managed into struct irq_affinity_desc irq/affinity: Fix a possible breakage drivers/pci/msi.c | 9 - include/linux/interrupt.h | 15 +-- include/linux/irq.h | 6 -- include/linux/irqdomain.h | 6 -- include/linux/msi.h | 4 ++-- kernel/irq/affinity.c | 38 +- kernel/irq/devres.c | 4 ++-- kernel/irq/irqdesc.c | 25 + kernel/irq/irqdomain.c| 4 ++-- kernel/irq/msi.c | 7 --- 10 files changed, 77 insertions(+), 41 deletions(-) -- 2.17.2
Re: [RFC PATCH v3] genirq/affinity: Create and transfer more irq desc info by a new structure
Hi Thomas, On 2018/11/29 6:03, Thomas Gleixner wrote: + affi_desc = kcalloc(nvec, sizeof(*affi_desc), GFP_KERNEL); Why do you want to do that separate allocation here? Just let I thought the irq_create_affinity_desc() also can be called by other functions which may convert cpumasks to irq_affinity_desc, such as __devm_irq_alloc_descs(). Now, I know I was wrong, will modify it. irq_create_affinity_masks() allocate an array of affinity descriptors and use that. There is no point in copying that stuff over and over. Setting the flag field can be done in the existing function as well. Can you please change the function signature and fixup the callers, if there are any of them? Copying this over and over is horrible. I have searched, no one calls __devm_irq_alloc_descs, it may be called by some users' own modules or drives. yes, I will change it. struct irq_affinity_desc { struct cpumask masks; unsigned intmanaged : 1; > }; yes, BTW, If the following is more fit for irq_affinity_desc: s/masks/mask/ s/managed/is_managed/ You can spare a lot of pointless churn by just keeping the 'affinity' name and only changing the struct type. The compiler will catch all places which need to be fixed and 'affinity' is generic enough to be used with the new struct type as well. As Bjorn said, even 'masks' is fine. Yes, I see Thanks, dou
Re: [RFC PATCH v3] genirq/affinity: Create and transfer more irq desc info by a new structure
Hi Bjorn, on 2018/11/29 4:00, Bjorn Helgaas wrote: [+cc linux-pci] Since you mention reports, are there URLs to mailing list archives you can include? OK, I will add it: https://marc.info/?l=linux-kernel&m=153543887027997&w=2 - entry = alloc_msi_entry(&dev->dev, nvec, masks); + entry = alloc_msi_entry(&dev->dev, nvec, affi_desc); Can you split this into two or more patches? Most of these changes Yes, next non-RFC version will do. are trivial and not very interesting, and the fact that they're all in one patch makes it hard to find and review the interesting bits. For example, 1) Rename all the local variables while keeping the type the same (or just leave the name the same; I think "affinity" would be a fine name, and I would be OK if we ended up with "struct irq_affinity_desc *masks" or "struct irq_affinity_desc *affinity"). This patch would obviously have no functional impact and would remove a lot of the trivial changes. Oops, how stupid I am ! 2) Add "struct irq_affinity_desc" containing only "struct cpumask masks" and irq_create_affinity_desc() (or leave the name as irq_create_affinity_masks() and adjust the signature). This would also have no functional impact and would be a fairly trivial patch. >3) Add "flags" to struct irq_affinity_desc and the related code. This is the real meat of your patch, and with the above out of the way, it will be much smaller and it'll be obvious what the important changes are. I see, :) Thanks, dou
[RFC PATCH v3] genirq/affinity: Create and transfer more irq desc info by a new structure
Now, Linux just spread the irq affinity to irqdesc core by a cpumask pointer. if an Vector's affinity is not NULL, it will be marked as managed. But, as Kashyap and Sumit reported, in MSI/-x subsystem, the pre/post vectors may be used to some extra reply queues for performance. their affinities are not NULL, but, they should be mapped as unmanaged interrupts. So, only transfering the irq affinity assignments is not enough Create a new structure named irq_affinity_desc, which include both the irq affinity masks and flags. Replace the cpumask pointer with a irq_affinity_desc pointer which allows to expand this in the future without touching all the functions ever again, just modify the data irq_affinity_desc structure. Reported-by: Kashyap Desai Reported-by: Sumit Saxena Suggested-by: Thomas Gleixner Signed-off-by: Dou Liyang --- Changelog: v2 --> v3 - Create a new irq_affinity_desc pointer to transfer the info suggested by tglx - rebase to the tip irq/core branch v1 --> v2 - Add a bitmap for marking if an interrupt is managed or not. the size of bitmap is runtime allocation. - Need more tests, Just test this patch in QEmu. - v1: https://lkml.org/lkml/2018/9/13/366 drivers/pci/msi.c | 29 ++-- include/linux/interrupt.h | 19 --- include/linux/irq.h | 3 ++- include/linux/irqdomain.h | 7 --- include/linux/msi.h | 4 ++-- kernel/irq/affinity.c | 40 +-- kernel/irq/devres.c | 23 -- kernel/irq/irqdesc.c | 32 +++ kernel/irq/irqdomain.c| 14 +++--- kernel/irq/msi.c | 21 ++-- 10 files changed, 135 insertions(+), 57 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 265ed3e4c920..431449163316 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -534,16 +534,15 @@ static int populate_msi_sysfs(struct pci_dev *pdev) static struct msi_desc * msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd) { - struct cpumask *masks = NULL; + struct irq_affinity_desc *affi_desc = NULL; struct msi_desc *entry; u16 control; if (affd) - masks = irq_create_affinity_masks(nvec, affd); - + affi_desc = irq_create_affinity_desc(nvec, affd); /* MSI Entry Initialization */ - entry = alloc_msi_entry(&dev->dev, nvec, masks); + entry = alloc_msi_entry(&dev->dev, nvec, affi_desc); if (!entry) goto out; @@ -567,7 +566,7 @@ msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd) pci_read_config_dword(dev, entry->mask_pos, &entry->masked); out: - kfree(masks); + kfree(affi_desc); return entry; } @@ -672,15 +671,15 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base, struct msix_entry *entries, int nvec, const struct irq_affinity *affd) { - struct cpumask *curmsk, *masks = NULL; + struct irq_affinity_desc *cur_affi_desc, *affi_desc = NULL; struct msi_desc *entry; int ret, i; if (affd) - masks = irq_create_affinity_masks(nvec, affd); + affi_desc = irq_create_affinity_desc(nvec, affd); - for (i = 0, curmsk = masks; i < nvec; i++) { - entry = alloc_msi_entry(&dev->dev, 1, curmsk); + for (i = 0, cur_affi_desc = affi_desc; i < nvec; i++) { + entry = alloc_msi_entry(&dev->dev, 1, cur_affi_desc); if (!entry) { if (!i) iounmap(base); @@ -701,12 +700,12 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base, entry->mask_base= base; list_add_tail(&entry->list, dev_to_msi_list(&dev->dev)); - if (masks) - curmsk++; + if (affi_desc) + cur_affi_desc++; } ret = 0; out: - kfree(masks); + kfree(affi_desc); return ret; } @@ -1264,7 +1263,7 @@ const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) for_each_pci_msi_entry(entry, dev) { if (i == nr) - return entry->affinity; + return &entry->affi_desc->masks; i++; } WARN_ON_ONCE(1); @@ -1272,11 +1271,11 @@ const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) } else if (dev->msi_enabled) { struct msi_desc *entry = first_pci_msi_entry(dev); - if (WARN_ON_ONCE(!entry || !entry->affinity || +
[RFC PATCH v2] gen/irq: Change the way to differentiate between managed and unmanaged interrupts by bitmap
As Kashyap and Sumit reported, in MSI/-x subsystem, the pre/post vectors may be used to some extra reply queues for performance. the best way to map the pre/post vectors is map them to unmanaged interrupts. But, current Linux can't do that, because the way which we use to differentiate between managed and unmanaged interrupts is affinity mask. If the affinity mask is present, the interrupt will be mark as managed. According to the pre- and post- vectors' affinity masks are set when kernel boots up, they are always managed interrupts as other MSI(-X) vectors. In other words, Linux can't both seting the affinity mask and the managed flag. So, decouple the managed flag from the affinity mask, add a new bitmap to differentiate between managed and unmanaged interrupts. Reported-by: Kashyap Desai Reported-by: Sumit Saxena Suggested-by: Thomas Gleixner Signed-off-by: Dou Liyang --- Changelog: - Add a bitmap for marking if an interrupt is managed or not. the size of bitmap is runtime allocation. - Need more tests, Just test this patch in QEmu. - v1: https://lkml.org/lkml/2018/9/13/366 arch/sparc/kernel/irq_64.c | 2 +- arch/x86/kernel/apic/io_apic.c | 4 ++-- drivers/base/platform-msi.c | 2 +- drivers/bus/fsl-mc/fsl-mc-msi.c | 2 +- drivers/irqchip/irq-gic-v4.c| 2 +- drivers/pci/msi.c | 12 include/linux/interrupt.h | 5 +++-- include/linux/irq.h | 4 ++-- include/linux/irqdomain.h | 6 +++--- include/linux/msi.h | 3 ++- kernel/irq/affinity.c | 27 +++ kernel/irq/devres.c | 13 +++-- kernel/irq/ipi.c| 4 ++-- kernel/irq/irqdesc.c| 15 ++- kernel/irq/irqdomain.c | 15 --- kernel/irq/msi.c| 7 +-- 16 files changed, 83 insertions(+), 40 deletions(-) diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c index 713670e6d13d..1ce85779b2bb 100644 --- a/arch/sparc/kernel/irq_64.c +++ b/arch/sparc/kernel/irq_64.c @@ -242,7 +242,7 @@ unsigned int irq_alloc(unsigned int dev_handle, unsigned int dev_ino) { int irq; - irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL, NULL); + irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL, NULL, NULL); if (irq <= 0) goto out; diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 2953bbf05c08..2e7c8b77874a 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -982,7 +982,7 @@ static int alloc_irq_from_domain(struct irq_domain *domain, int ioapic, u32 gsi, return __irq_domain_alloc_irqs(domain, irq, 1, ioapic_alloc_attr_node(info), - info, legacy, NULL); + info, legacy, NULL, NULL); } /* @@ -1017,7 +1017,7 @@ static int alloc_isa_irq_from_domain(struct irq_domain *domain, } else { info->flags |= X86_IRQ_ALLOC_LEGACY; irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true, - NULL); + NULL, NULL); if (irq >= 0) { irq_data = irq_domain_get_irq_data(domain, irq); data = irq_data->chip_data; diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c index f39a920496fb..c5d319d92eb1 100644 --- a/drivers/base/platform-msi.c +++ b/drivers/base/platform-msi.c @@ -134,7 +134,7 @@ static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq, } for (i = 0; i < nvec; i++) { - desc = alloc_msi_entry(dev, 1, NULL); + desc = alloc_msi_entry(dev, 1, NULL, 0); if (!desc) break; diff --git a/drivers/bus/fsl-mc/fsl-mc-msi.c b/drivers/bus/fsl-mc/fsl-mc-msi.c index 8b9c66d7c4ff..92e6266e10d2 100644 --- a/drivers/bus/fsl-mc/fsl-mc-msi.c +++ b/drivers/bus/fsl-mc/fsl-mc-msi.c @@ -214,7 +214,7 @@ static int fsl_mc_msi_alloc_descs(struct device *dev, unsigned int irq_count) struct msi_desc *msi_desc; for (i = 0; i < irq_count; i++) { - msi_desc = alloc_msi_entry(dev, 1, NULL); + msi_desc = alloc_msi_entry(dev, 1, NULL, 0); if (!msi_desc) { dev_err(dev, "Failed to allocate msi entry\n"); error = -ENOMEM; diff --git a/drivers/irqchip/irq-gic-v4.c b/drivers/irqchip/irq-gic-v4.c index dba9d67cb9c1..17b459e283a1 100644 --- a/drivers/irqchip/irq-gic-v4.c +++ b/drivers/irqchip/irq-gic-v4.c @@ -119,7 +119,7 @@ int its_alloc_vcpu_irqs(struct its_vm *vm) vpe_base_irq = __irq_domain_alloc_irqs(vm->domain, -1, vm->nr_vpes,
Re: [RFC PATCH] irq/affinity: Mark the pre/post vectors as regular interrupts
Hi Kashyap, On 2018/9/20 16:39, Kashyap Desai worte: Dou - Do you want me to test your patch or shall I wait for next version ? I'm sorry, please wait for the next version. Thanks, dou
[tip:x86/apic] irq/matrix: Spread managed interrupts on allocation
Commit-ID: 76f99ae5b54d48430d1f0c5512a84da0ff9761e0 Gitweb: https://git.kernel.org/tip/76f99ae5b54d48430d1f0c5512a84da0ff9761e0 Author: Dou Liyang AuthorDate: Sun, 9 Sep 2018 01:58:38 +0800 Committer: Thomas Gleixner CommitDate: Tue, 18 Sep 2018 18:27:24 +0200 irq/matrix: Spread managed interrupts on allocation Linux spreads out the non managed interrupt across the possible target CPUs to avoid vector space exhaustion. Managed interrupts are treated differently, as for them the vectors are reserved (with guarantee) when the interrupt descriptors are initialized. When the interrupt is requested a real vector is assigned. The assignment logic uses the first CPU in the affinity mask for assignment. If the interrupt has more than one CPU in the affinity mask, which happens when a multi queue device has less queues than CPUs, then doing the same search as for non managed interrupts makes sense as it puts the interrupt on the least interrupt plagued CPU. For single CPU affine vectors that's obviously a NOOP. Restructre the matrix allocation code so it does the 'best CPU' search, add the sanity check for an empty affinity mask and adapt the call site in the x86 vector management code. [ tglx: Added the empty mask check to the core and improved change log ] Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: h...@zytor.com Link: https://lkml.kernel.org/r/20180908175838.14450-2-dou_liy...@163.com --- arch/x86/kernel/apic/vector.c | 9 - include/linux/irq.h | 3 ++- kernel/irq/matrix.c | 17 ++--- 3 files changed, 20 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 7654febd5102..652e7ffa9b9d 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -313,14 +313,13 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest) struct apic_chip_data *apicd = apic_chip_data(irqd); int vector, cpu; - cpumask_and(vector_searchmask, vector_searchmask, affmsk); - cpu = cpumask_first(vector_searchmask); - if (cpu >= nr_cpu_ids) - return -EINVAL; + cpumask_and(vector_searchmask, dest, affmsk); + /* set_affinity might call here for nothing */ if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask)) return 0; - vector = irq_matrix_alloc_managed(vector_matrix, cpu); + vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask, + &cpu); trace_vector_alloc_managed(irqd->irq, vector, vector); if (vector < 0) return vector; diff --git a/include/linux/irq.h b/include/linux/irq.h index 201de12a9957..c9bffda04a45 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -1151,7 +1151,8 @@ void irq_matrix_offline(struct irq_matrix *m); void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool replace); int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk); void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk); -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu); +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu); void irq_matrix_reserve(struct irq_matrix *m); void irq_matrix_remove_reserved(struct irq_matrix *m); int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 67768bbe736e..6e6d467f3dec 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -260,11 +260,21 @@ void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk) * @m: Matrix pointer * @cpu: On which CPU the interrupt should be allocated */ -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu) +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, +unsigned int *mapped_cpu) { - struct cpumap *cm = per_cpu_ptr(m->maps, cpu); - unsigned int bit, end = m->alloc_end; + unsigned int bit, cpu, end = m->alloc_end; + struct cpumap *cm; + + if (cpumask_empty(msk)) + return -EINVAL; + cpu = matrix_find_best_cpu(m, msk); + if (cpu == UINT_MAX) + return -ENOSPC; + + cm = per_cpu_ptr(m->maps, cpu); + end = m->alloc_end; /* Get managed bit which are not allocated */ bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end); bit = find_first_bit(m->scratch_map, end); @@ -273,6 +283,7 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu) set_bit(bit, cm->alloc_map); cm->allocated++; m->total_allocated++; +
[tip:x86/apic] irq/matrix: Split out the CPU selection code into a helper
Commit-ID: 8ffe4e61c06a48324cfd97f1199bb9838acce2f2 Gitweb: https://git.kernel.org/tip/8ffe4e61c06a48324cfd97f1199bb9838acce2f2 Author: Dou Liyang AuthorDate: Sun, 9 Sep 2018 01:58:37 +0800 Committer: Thomas Gleixner CommitDate: Tue, 18 Sep 2018 18:27:24 +0200 irq/matrix: Split out the CPU selection code into a helper Linux finds the CPU which has the lowest vector allocation count to spread out the non managed interrupts across the possible target CPUs, but does not do so for managed interrupts. Split out the CPU selection code into a helper function for reuse. No functional change. Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: h...@zytor.com Link: https://lkml.kernel.org/r/20180908175838.14450-1-dou_liy...@163.com --- kernel/irq/matrix.c | 65 +++-- 1 file changed, 38 insertions(+), 27 deletions(-) diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 5092494bf261..67768bbe736e 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -124,6 +124,27 @@ static unsigned int matrix_alloc_area(struct irq_matrix *m, struct cpumap *cm, return area; } +/* Find the best CPU which has the lowest vector allocation count */ +static unsigned int matrix_find_best_cpu(struct irq_matrix *m, + const struct cpumask *msk) +{ + unsigned int cpu, best_cpu, maxavl = 0; + struct cpumap *cm; + + best_cpu = UINT_MAX; + + for_each_cpu(cpu, msk) { + cm = per_cpu_ptr(m->maps, cpu); + + if (!cm->online || cm->available <= maxavl) + continue; + + best_cpu = cpu; + maxavl = cm->available; + } + return best_cpu; +} + /** * irq_matrix_assign_system - Assign system wide entry in the matrix * @m: Matrix pointer @@ -322,37 +343,27 @@ void irq_matrix_remove_reserved(struct irq_matrix *m) int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, bool reserved, unsigned int *mapped_cpu) { - unsigned int cpu, best_cpu, maxavl = 0; + unsigned int cpu, bit; struct cpumap *cm; - unsigned int bit; - best_cpu = UINT_MAX; - for_each_cpu(cpu, msk) { - cm = per_cpu_ptr(m->maps, cpu); - - if (!cm->online || cm->available <= maxavl) - continue; + cpu = matrix_find_best_cpu(m, msk); + if (cpu == UINT_MAX) + return -ENOSPC; - best_cpu = cpu; - maxavl = cm->available; - } + cm = per_cpu_ptr(m->maps, cpu); + bit = matrix_alloc_area(m, cm, 1, false); + if (bit >= m->alloc_end) + return -ENOSPC; + cm->allocated++; + cm->available--; + m->total_allocated++; + m->global_available--; + if (reserved) + m->global_reserved--; + *mapped_cpu = cpu; + trace_irq_matrix_alloc(bit, cpu, m, cm); + return bit; - if (maxavl) { - cm = per_cpu_ptr(m->maps, best_cpu); - bit = matrix_alloc_area(m, cm, 1, false); - if (bit < m->alloc_end) { - cm->allocated++; - cm->available--; - m->total_allocated++; - m->global_available--; - if (reserved) - m->global_reserved--; - *mapped_cpu = best_cpu; - trace_irq_matrix_alloc(bit, best_cpu, m, cm); - return bit; - } - } - return -ENOSPC; } /**
Re: [PATCH v3 2/2] irq/matrix: Spread managed interrupts on allocation
Dear Thomas, On 2018/9/17 23:32, Thomas Gleixner wrote: [...] I think it's still worthwhile to do that, but the changelog needs a major overhaul as right now it's outright misleading. I'll just amend it with something along the above lines, unless someone disagrees. Yeah, Yes, right, I was wrong, can't prevent vector exhaustion, just make it looks balance as you said. Thank you so much. That said, it might also be interesting to allow user space affinity settings on managed interrupts. Not meant for the pre/post vector case, which just needs to be made non managed. It's meant for the case where a Yes, I am cooking according to the direction you are proposing. Recently, I changed my PC and just completed the configuration of the environment. ;-) Thanks, dou
Re: [PATCH] sched/core: Fix compiling warring in smp=n case
At 08/10/2018 10:35 AM, Dou Liyang wrote: When compiling kernel with SMP disabled, the build warns with: kernel/sched/core.c: In function ‘update_rq_clock_task’: kernel/sched/core.c:139:17: warning: unused variable ‘irq_delta’ [-Wunused-variable] s64 steal = 0, irq_delta = 0; Fix this by revert the HAVE_SCHED_AVG_IRQ to defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) Fixes: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()") Hi, Miguel also found this warning. Can we pick it up now? ;-) Thanks, dou
[RFC PATCH] irq/affinity: Mark the pre/post vectors as regular interrupts
From: Dou Liyang As Kashyap and Sumit reported, in MSI/-x subsystem, the pre/post vectors may be used to some extra reply queues for performance. the best way to map the pre/post vectors is map them to the local numa node. But, current Linux can't do that, because The pre and post vectors are marked managed and their affinity mask is set to the irq default affinity mask. The default affinity mask is by default ALL cpus, but it can be tweaked both on the kernel command line and via proc. If that mask is only a subset of CPUs and all of them go offline then these vectors are shutdown in managed mode. So, clear these affinity mask and check it in alloc_desc() to leave them as regular interrupts which can be affinity controlled and also can move freely on hotplug. Note: will break the validation of affinity mask(s) Reported-by: Kashyap Desai Reported-by: Sumit Saxena Suggested-by: Thomas Gleixner Signed-off-by: Dou Liyang --- kernel/irq/affinity.c | 9 ++--- kernel/irq/irqdesc.c | 24 ++-- 2 files changed, 16 insertions(+), 17 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index f4f29b9d90ee..ba35a5050dd2 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -204,7 +204,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) /* Fill out vectors at the beginning that don't need affinity */ for (curvec = 0; curvec < affd->pre_vectors; curvec++) - cpumask_copy(masks + curvec, irq_default_affinity); + cpumask_clear(masks + curvec); /* Stabilize the cpumasks */ get_online_cpus(); @@ -234,10 +234,13 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) /* Fill out vectors at the end that don't need affinity */ if (usedvecs >= affvecs) curvec = affd->pre_vectors + affvecs; - else + else { curvec = affd->pre_vectors + usedvecs; + for (; curvec < affd->pre_vectors + affvecs; curvec++) + cpumask_copy(masks + curvec, irq_default_affinity); + } for (; curvec < nvecs; curvec++) - cpumask_copy(masks + curvec, irq_default_affinity); + cpumask_clear(masks + curvec); outnodemsk: free_node_to_cpumask(node_to_cpumask); diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 578d0e5f1b5b..5cffa791a20b 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -453,24 +453,20 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node, { const struct cpumask *mask = NULL; struct irq_desc *desc; - unsigned int flags; + unsigned int flags = 0; int i; - /* Validate affinity mask(s) */ - if (affinity) { - for (i = 0, mask = affinity; i < cnt; i++, mask++) { - if (cpumask_empty(mask)) - return -EINVAL; - } - } - - flags = affinity ? IRQD_AFFINITY_MANAGED | IRQD_MANAGED_SHUTDOWN : 0; - mask = NULL; - for (i = 0; i < cnt; i++) { if (affinity) { - node = cpu_to_node(cpumask_first(affinity)); - mask = affinity; + if (cpumask_empty(affinity)) { + flags = 0; + mask = NULL; + } else { + flags = IRQD_AFFINITY_MANAGED | + IRQD_MANAGED_SHUTDOWN; + mask = affinity; + node = cpu_to_node(cpumask_first(affinity)); + } affinity++; } desc = alloc_desc(start + i, node, flags, mask, owner); -- 2.14.3
[PATCH v3 2/2] irq/matrix: Spread managed interrupts on allocation
From: Dou Liyang Linux has spread out the non managed interrupt across the possible target CPUs to avoid vector space exhaustion. But, the same situation may happen on the managed interrupts. Spread managed interrupt on allocation as well. Note: also change the return value for the empty search mask case from EINVAL to ENOSPC. Signed-off-by: Dou Liyang --- Changelog v3 --> v2 - Mention the changes in the changelog suggested by tglx - Use the new matrix_find_best_cpu() helper arch/x86/kernel/apic/vector.c | 8 +++- include/linux/irq.h | 3 ++- kernel/irq/matrix.c | 14 +++--- 3 files changed, 16 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 9f148e3d45b4..b7fc290b4b98 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -313,14 +313,12 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest) struct apic_chip_data *apicd = apic_chip_data(irqd); int vector, cpu; - cpumask_and(vector_searchmask, vector_searchmask, affmsk); - cpu = cpumask_first(vector_searchmask); - if (cpu >= nr_cpu_ids) - return -EINVAL; + cpumask_and(vector_searchmask, dest, affmsk); + /* set_affinity might call here for nothing */ if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask)) return 0; - vector = irq_matrix_alloc_managed(vector_matrix, cpu); + vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask, &cpu); trace_vector_alloc_managed(irqd->irq, vector, vector); if (vector < 0) return vector; diff --git a/include/linux/irq.h b/include/linux/irq.h index 201de12a9957..c9bffda04a45 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -1151,7 +1151,8 @@ void irq_matrix_offline(struct irq_matrix *m); void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool replace); int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk); void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk); -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu); +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu); void irq_matrix_reserve(struct irq_matrix *m); void irq_matrix_remove_reserved(struct irq_matrix *m); int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 67768bbe736e..34f97c4f10d7 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -260,11 +260,18 @@ void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk) * @m: Matrix pointer * @cpu: On which CPU the interrupt should be allocated */ -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu) +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu) { - struct cpumap *cm = per_cpu_ptr(m->maps, cpu); - unsigned int bit, end = m->alloc_end; + unsigned int bit, cpu, end = m->alloc_end; + struct cpumap *cm; + + cpu = matrix_find_best_cpu(m, msk); + if (cpu == UINT_MAX) + return -ENOSPC; + cm = per_cpu_ptr(m->maps, cpu); + end = m->alloc_end; /* Get managed bit which are not allocated */ bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end); bit = find_first_bit(m->scratch_map, end); @@ -273,6 +280,7 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu) set_bit(bit, cm->alloc_map); cm->allocated++; m->total_allocated++; + *mapped_cpu = cpu; trace_irq_matrix_alloc_managed(bit, cpu, m, cm); return bit; } -- 2.14.3
[PATCH v3 1/2] irq/matrix: Split out the CPU finding code into a helper
From: Dou Liyang Linux finds the CPU which has the lowest vector allocation count to spread out the non managed interrupt across the possible target CPUs. This common CPU finding code will also be used in managed case, So Split it out into a helper for preparation. Signed-off-by: Dou Liyang --- Changelog v3 --> v2 - Make the matrix_find_best_cpu() simple and obvious suggested by tglx - Remove the indentation totally suggested by tglx kernel/irq/matrix.c | 65 +++-- 1 file changed, 38 insertions(+), 27 deletions(-) diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 5092494bf261..67768bbe736e 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -124,6 +124,27 @@ static unsigned int matrix_alloc_area(struct irq_matrix *m, struct cpumap *cm, return area; } +/* Find the best CPU which has the lowest vector allocation count */ +static unsigned int matrix_find_best_cpu(struct irq_matrix *m, + const struct cpumask *msk) +{ + unsigned int cpu, best_cpu, maxavl = 0; + struct cpumap *cm; + + best_cpu = UINT_MAX; + + for_each_cpu(cpu, msk) { + cm = per_cpu_ptr(m->maps, cpu); + + if (!cm->online || cm->available <= maxavl) + continue; + + best_cpu = cpu; + maxavl = cm->available; + } + return best_cpu; +} + /** * irq_matrix_assign_system - Assign system wide entry in the matrix * @m: Matrix pointer @@ -322,37 +343,27 @@ void irq_matrix_remove_reserved(struct irq_matrix *m) int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, bool reserved, unsigned int *mapped_cpu) { - unsigned int cpu, best_cpu, maxavl = 0; + unsigned int cpu, bit; struct cpumap *cm; - unsigned int bit; - best_cpu = UINT_MAX; - for_each_cpu(cpu, msk) { - cm = per_cpu_ptr(m->maps, cpu); - - if (!cm->online || cm->available <= maxavl) - continue; + cpu = matrix_find_best_cpu(m, msk); + if (cpu == UINT_MAX) + return -ENOSPC; - best_cpu = cpu; - maxavl = cm->available; - } + cm = per_cpu_ptr(m->maps, cpu); + bit = matrix_alloc_area(m, cm, 1, false); + if (bit >= m->alloc_end) + return -ENOSPC; + cm->allocated++; + cm->available--; + m->total_allocated++; + m->global_available--; + if (reserved) + m->global_reserved--; + *mapped_cpu = cpu; + trace_irq_matrix_alloc(bit, cpu, m, cm); + return bit; - if (maxavl) { - cm = per_cpu_ptr(m->maps, best_cpu); - bit = matrix_alloc_area(m, cm, 1, false); - if (bit < m->alloc_end) { - cm->allocated++; - cm->available--; - m->total_allocated++; - m->global_available--; - if (reserved) - m->global_reserved--; - *mapped_cpu = best_cpu; - trace_irq_matrix_alloc(bit, best_cpu, m, cm); - return bit; - } - } - return -ENOSPC; } /** -- 2.14.3
[PATCH v2 2/2] irq/matrix: Spread managed interrupts on allocation
From: Dou Liyang Linux has spread out the non managed interrupt across the possible target CPUs to avoid vector space exhaustion. But, the same situation may happen on the managed interrupts. Spread managed interrupt on allocation as well. Fixes: a0c9259dc4e1("irq/matrix: Spread interrupts on allocation") Signed-off-by: Dou Liyang --- Changelog v2 --> v1 -Avoid a churn and the extra indentation suggested by tglx arch/x86/kernel/apic/vector.c | 8 +++- include/linux/irq.h | 3 ++- kernel/irq/matrix.c | 14 +++--- 3 files changed, 16 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 9f148e3d45b4..b7fc290b4b98 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -313,14 +313,12 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest) struct apic_chip_data *apicd = apic_chip_data(irqd); int vector, cpu; - cpumask_and(vector_searchmask, vector_searchmask, affmsk); - cpu = cpumask_first(vector_searchmask); - if (cpu >= nr_cpu_ids) - return -EINVAL; + cpumask_and(vector_searchmask, dest, affmsk); + /* set_affinity might call here for nothing */ if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask)) return 0; - vector = irq_matrix_alloc_managed(vector_matrix, cpu); + vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask, &cpu); trace_vector_alloc_managed(irqd->irq, vector, vector); if (vector < 0) return vector; diff --git a/include/linux/irq.h b/include/linux/irq.h index 201de12a9957..c9bffda04a45 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -1151,7 +1151,8 @@ void irq_matrix_offline(struct irq_matrix *m); void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool replace); int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk); void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk); -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu); +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu); void irq_matrix_reserve(struct irq_matrix *m); void irq_matrix_remove_reserved(struct irq_matrix *m); int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 0cbcdef9337a..753b8b0ba045 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -259,11 +259,18 @@ void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk) * @m: Matrix pointer * @cpu: On which CPU the interrupt should be allocated */ -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu) +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu) { - struct cpumap *cm = per_cpu_ptr(m->maps, cpu); unsigned int bit, end = m->alloc_end; + unsigned int best_cpu = UINT_MAX; + struct cpumap *cm; + if (!matrix_find_best_cpu(m, msk, &best_cpu)) + return -ENOSPC; + + cm = per_cpu_ptr(m->maps, best_cpu); + end = m->alloc_end; /* Get managed bit which are not allocated */ bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end); bit = find_first_bit(m->scratch_map, end); @@ -272,7 +279,8 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu) set_bit(bit, cm->alloc_map); cm->allocated++; m->total_allocated++; - trace_irq_matrix_alloc_managed(bit, cpu, m, cm); + *mapped_cpu = best_cpu; + trace_irq_matrix_alloc_managed(bit, best_cpu, m, cm); return bit; } -- 2.14.3
[PATCH v2 1/2] irq/matrix: Split out the CPU finding code into a helper
From: Dou Liyang Linux finds the CPU which has the lowest vector allocation count to spread out the non managed interrupt across the possible target CPUs. This common CPU finding code will also be used in managed case, So Split it out into a helper for preparation. Signed-off-by: Dou Liyang --- Changelog v2 --> v1 -Avoid a churn and the extra indentation suggested by tglx kernel/irq/matrix.c | 60 +++-- 1 file changed, 35 insertions(+), 25 deletions(-) diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 5092494bf261..0cbcdef9337a 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -124,6 +124,26 @@ static unsigned int matrix_alloc_area(struct irq_matrix *m, struct cpumap *cm, return area; } +/* Find the best CPU which has the lowest vector allocation count */ +static int matrix_find_best_cpu(struct irq_matrix *m, + const struct cpumask *msk, int *best_cpu) +{ + unsigned int cpu, maxavl = 0; + struct cpumap *cm; + + for_each_cpu(cpu, msk) { + cm = per_cpu_ptr(m->maps, cpu); + + if (!cm->online || cm->available <= maxavl) + continue; + + *best_cpu = cpu; + maxavl = cm->available; + } + + return maxavl; +} + /** * irq_matrix_assign_system - Assign system wide entry in the matrix * @m: Matrix pointer @@ -322,35 +342,25 @@ void irq_matrix_remove_reserved(struct irq_matrix *m) int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, bool reserved, unsigned int *mapped_cpu) { - unsigned int cpu, best_cpu, maxavl = 0; + unsigned int best_cpu = UINT_MAX; struct cpumap *cm; unsigned int bit; - best_cpu = UINT_MAX; - for_each_cpu(cpu, msk) { - cm = per_cpu_ptr(m->maps, cpu); - - if (!cm->online || cm->available <= maxavl) - continue; - - best_cpu = cpu; - maxavl = cm->available; - } + if (!matrix_find_best_cpu(m, msk, &best_cpu)) + return -ENOSPC; - if (maxavl) { - cm = per_cpu_ptr(m->maps, best_cpu); - bit = matrix_alloc_area(m, cm, 1, false); - if (bit < m->alloc_end) { - cm->allocated++; - cm->available--; - m->total_allocated++; - m->global_available--; - if (reserved) - m->global_reserved--; - *mapped_cpu = best_cpu; - trace_irq_matrix_alloc(bit, best_cpu, m, cm); - return bit; - } + cm = per_cpu_ptr(m->maps, best_cpu); + bit = matrix_alloc_area(m, cm, 1, false); + if (bit < m->alloc_end) { + cm->allocated++; + cm->available--; + m->total_allocated++; + m->global_available--; + if (reserved) + m->global_reserved--; + *mapped_cpu = best_cpu; + trace_irq_matrix_alloc(bit, best_cpu, m, cm); + return bit; } return -ENOSPC; } -- 2.14.3
[PATCH 2/2] irq/matrix: Spread managed interrupts on allocation
From: Dou Liyang Linux has spread out the non managed interrupt across the possible target CPUs to avoid vector space exhaustion. But, the same situation may happen on the managed interrupts. Spread managed interrupt on allocation as well. Fixes: a0c9259dc4e1("irq/matrix: Spread interrupts on allocation") Signed-off-by: Dou Liyang --- arch/x86/kernel/apic/vector.c | 8 +++- include/linux/irq.h | 3 ++- kernel/irq/matrix.c | 32 3 files changed, 25 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 9f148e3d45b4..b7fc290b4b98 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -313,14 +313,12 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest) struct apic_chip_data *apicd = apic_chip_data(irqd); int vector, cpu; - cpumask_and(vector_searchmask, vector_searchmask, affmsk); - cpu = cpumask_first(vector_searchmask); - if (cpu >= nr_cpu_ids) - return -EINVAL; + cpumask_and(vector_searchmask, dest, affmsk); + /* set_affinity might call here for nothing */ if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask)) return 0; - vector = irq_matrix_alloc_managed(vector_matrix, cpu); + vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask, &cpu); trace_vector_alloc_managed(irqd->irq, vector, vector); if (vector < 0) return vector; diff --git a/include/linux/irq.h b/include/linux/irq.h index 201de12a9957..c9bffda04a45 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -1151,7 +1151,8 @@ void irq_matrix_offline(struct irq_matrix *m); void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool replace); int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk); void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk); -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu); +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu); void irq_matrix_reserve(struct irq_matrix *m); void irq_matrix_remove_reserved(struct irq_matrix *m); int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 5eb0c8b857f0..b449a749b354 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -259,21 +259,29 @@ void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk) * @m: Matrix pointer * @cpu: On which CPU the interrupt should be allocated */ -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu) +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu) { - struct cpumap *cm = per_cpu_ptr(m->maps, cpu); unsigned int bit, end = m->alloc_end; + unsigned int best_cpu = UINT_MAX; + struct cpumap *cm; - /* Get managed bit which are not allocated */ - bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end); - bit = find_first_bit(m->scratch_map, end); - if (bit >= end) - return -ENOSPC; - set_bit(bit, cm->alloc_map); - cm->allocated++; - m->total_allocated++; - trace_irq_matrix_alloc_managed(bit, cpu, m, cm); - return bit; + if (matrix_find_best_cpu(m, msk, &best_cpu)) { + cm = per_cpu_ptr(m->maps, best_cpu); + end = m->alloc_end; + /* Get managed bit which are not allocated */ + bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end); + bit = find_first_bit(m->scratch_map, end); + if (bit >= end) + return -ENOSPC; + set_bit(bit, cm->alloc_map); + cm->allocated++; + m->total_allocated++; + *mapped_cpu = best_cpu; + trace_irq_matrix_alloc_managed(bit, best_cpu, m, cm); + return bit; + } + return -ENOSPC; } /** -- 2.14.3
[PATCH 1/2] irq/matrix: Split out the CPU finding code into a helper
From: Dou Liyang Linux finds the CPU which has the lowest vector allocation count to spread out the non managed interrupt across the possible target CPUs. This common CPU finding code will also be used in managed case, So Split it out into a helper for preparation. Signed-off-by: Dou Liyang --- kernel/irq/matrix.c | 35 ++- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 5092494bf261..5eb0c8b857f0 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -124,6 +124,26 @@ static unsigned int matrix_alloc_area(struct irq_matrix *m, struct cpumap *cm, return area; } +/* Find the best CPU which has the lowest vector allocation count */ +static int matrix_find_best_cpu(struct irq_matrix *m, + const struct cpumask *msk, int *best_cpu) +{ + unsigned int cpu, maxavl = 0; + struct cpumap *cm; + + for_each_cpu(cpu, msk) { + cm = per_cpu_ptr(m->maps, cpu); + + if (!cm->online || cm->available <= maxavl) + continue; + + *best_cpu = cpu; + maxavl = cm->available; + } + + return maxavl; +} + /** * irq_matrix_assign_system - Assign system wide entry in the matrix * @m: Matrix pointer @@ -322,22 +342,11 @@ void irq_matrix_remove_reserved(struct irq_matrix *m) int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, bool reserved, unsigned int *mapped_cpu) { - unsigned int cpu, best_cpu, maxavl = 0; + unsigned int best_cpu = UINT_MAX; struct cpumap *cm; unsigned int bit; - best_cpu = UINT_MAX; - for_each_cpu(cpu, msk) { - cm = per_cpu_ptr(m->maps, cpu); - - if (!cm->online || cm->available <= maxavl) - continue; - - best_cpu = cpu; - maxavl = cm->available; - } - - if (maxavl) { + if (matrix_find_best_cpu(m, msk, &best_cpu)) { cm = per_cpu_ptr(m->maps, best_cpu); bit = matrix_alloc_area(m, cm, 1, false); if (bit < m->alloc_end) { -- 2.14.3
[PATCH v3] acpi/processor: Fix the return value of acpi_processor_ids_walk()
ACPI driver should make sure all the processor IDs in their ACPI Namespace are unique. the driver performs a depth-first walk of the namespace tree and calls the acpi_processor_ids_walk() to check the duplicate IDs. But, the acpi_processor_ids_walk() mistakes the return value. If a processor is checked, it returns true which causes the walk break immediately, and other processors will never be checked. Repace the value with AE_OK which is the standard acpi_status value. And don't abort the namespace walk even on error. Fixes 8c8cb30f49b8 ("acpi/processor: Implement DEVICE operator for processor enumeration") Signed-off-by: Dou Liyang --- Changelog: v2 --> v3: - Fix a compiler error reported by LKP v1 --> v2: - Fix the check against duplicate IDs suggested by Rafael. Now,the duplicate IDs only be found in Ivb42 machine, and we have added this check at linux-4.9. But, we introduced a bug in linux-4.12 by commit 8c8cb30f49b8. For resolving the bug, firstly, I removed the check[1]. because Linux will compare the coming ID with present processors when it hot-added a physical CPU and will avoid using duplicate IDs. But, seems we should consider all the possible processors. So, with this patch, All the processors with the same IDs will never be hot-plugged. [1] https://lkml.org/lkml/2018/5/28/213 --- drivers/acpi/acpi_processor.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index 449d86d39965..fc447410ae4d 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -643,7 +643,7 @@ static acpi_status __init acpi_processor_ids_walk(acpi_handle handle, status = acpi_get_type(handle, &acpi_type); if (ACPI_FAILURE(status)) - return false; + return status; switch (acpi_type) { case ACPI_TYPE_PROCESSOR: @@ -663,11 +663,12 @@ static acpi_status __init acpi_processor_ids_walk(acpi_handle handle, } processor_validated_ids_update(uid); - return true; + return AE_OK; err: + /* Exit on error, but don't abort the namespace walk */ acpi_handle_info(handle, "Invalid processor object\n"); - return false; + return AE_OK; } -- 2.14.3
[RESEND PATCH v2] acpi/processor: Fix the return value of acpi_processor_ids_walk()
ACPI driver should make sure all the processor IDs in their ACPI Namespace are unique. the driver performs a depth-first walk of the namespace tree and calls the acpi_processor_ids_walk() to check the duplicate IDs. But, the acpi_processor_ids_walk() mistakes the return value. If a processor is checked, it returns true which causes the walk break immediately, and other processors will never be checked. Repace the value with AE_OK which is the standard acpi_status value. And don't abort the namespace walk even on error. Fixes 8c8cb30f49b8 ("acpi/processor: Implement DEVICE operator for processor enumeration") Signed-off-by: Dou Liyang --- Changelog: v1 --> v2: - Fix the check against duplicate IDs suggested by Rafael. Now,the duplicate IDs only be found in Ivb42 machine, and we have added this check at linux-4.9. But, we introduced a bug in linux-4.12 by commit 8c8cb30f49b8. For resolving the bug, firstly, I removed the check[1]. because Linux will compare the coming ID with present processors when it hot-added a physical CPU and will avoid using duplicate IDs. But, seems we should consider all the possible processors. So, with this patch, All the processors with the same IDs will never be hot-plugged. [1] https://lkml.org/lkml/2018/5/28/213 --- drivers/acpi/acpi_processor.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index 449d86d39965..a59870ccd5ca 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -643,7 +643,7 @@ static acpi_status __init acpi_processor_ids_walk(acpi_handle handle, status = acpi_get_type(handle, &acpi_type); if (ACPI_FAILURE(status)) - return false; + return_ACPI_STATUS(status); switch (acpi_type) { case ACPI_TYPE_PROCESSOR: @@ -663,11 +663,12 @@ static acpi_status __init acpi_processor_ids_walk(acpi_handle handle, } processor_validated_ids_update(uid); - return true; + return AE_OK; err: + /* Exit on error, but don't abort the namespace walk */ acpi_handle_info(handle, "Invalid processor object\n"); - return false; + return AE_OK; } -- 2.14.3
[PATCH] sched/core: Fix compiling warring in smp=n case
When compiling kernel with SMP disabled, the build warns with: kernel/sched/core.c: In function ‘update_rq_clock_task’: kernel/sched/core.c:139:17: warning: unused variable ‘irq_delta’ [-Wunused-variable] s64 steal = 0, irq_delta = 0; Fix this by revert the HAVE_SCHED_AVG_IRQ to defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) Fixes: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()") Signed-off-by: Dou Liyang --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c45de46fdf10..ef954d96c80c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -177,7 +177,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) rq->clock_task += delta; -#ifdef HAVE_SCHED_AVG_IRQ +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) update_irq_load_avg(rq, irq_delta + steal); #endif -- 2.14.3
Re: [PATCH v4 2/4] x86/boot: Add acpitb.c to parse acpi tables
At 07/23/2018 05:29 PM, Chao Fan wrote: Imitate the ACPI code to parse ACPI tables. Functions are simplified cause some operations are not needed here. And also, this method won't influence the initialization of ACPI. Signed-off-by: Chao Fan Hi Fan, I know you got the code from acpica subsystem and EFI code... and do many adaptation work for KASLR. It's awesome! I think you can add some other simple comments. - what differences between your function and the function you based on and why did you do that? ... to make this more credible and easy to remember the details as time goes on. Also some concerns below. --- [...] + else if (!strncmp(sig, EFI32_LOADER_SIGNATURE, 4)) + efi_64 = false; + else { + debug_putstr("Wrong efi loader signature.\n"); s/efi/EFI/, also need fix in the comments below. + return false; + } + [...] + /* +* Get rsdp from efi tables. +* If we find acpi table, go on searching for acpi20 table. +* If we didn't get acpi20 table then use acpi table. +* If neither acpi table nor acpi20 table found, +* return false. +*/ + if (!(efi_guidcmp(guid, ACPI_TABLE_GUID)) && !acpi_20) { + *rsdp_addr = (acpi_physical_address)table; + acpi_20 = false; + find_rsdp = true; + } else if (!(efi_guidcmp(guid, ACPI_20_TABLE_GUID))) { + *rsdp_addr = (acpi_physical_address)table; + acpi_20 = true; + return true; If we find the ACPI 2.0, we will return immediately, so the variable and logic of _acpi_20_ is redundant. Thanks, dou
Re: [PATCH v4 3/4] x86/boot/KASLR: Walk srat tables to filter immovable memory
At 08/02/2018 03:05 PM, Thomas Gleixner wrote: [...] Folks. Can you please both stop this annoying habit of keeping the full context of the mail and then sprinkling a random sentence into the middle? I see. won’t do this stupid thing again. Thanks, dou
Re: [PATCH v4 4/4] x86/boot/KASLR: Limit kaslr to choosing the immovable memory
At 08/02/2018 02:00 PM, Chao Fan wrote: On Thu, Aug 02, 2018 at 01:46:29PM +0800, Dou Liyang wrote: Hi Fan, At 07/23/2018 05:29 PM, Chao Fan wrote: If 'CONFIG_MEMORY_HOTREMOVE' specified and the account of immovable memory regions is not zero. Calculate the intersection between memory regions from e820/efi memory table and immovable memory regions. Or go on the old code. Rename process_mem_region to slots_count to match slots_fetch_random, and name new function as process_mem_region. Signed-off-by: Chao Fan --- arch/x86/boot/compressed/kaslr.c | 66 ++-- 1 file changed, 55 insertions(+), 11 deletions(-) diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c index 4705682caf1f..10bda3a1fcaa 100644 --- a/arch/x86/boot/compressed/kaslr.c +++ b/arch/x86/boot/compressed/kaslr.c @@ -631,9 +631,9 @@ static unsigned long slots_fetch_random(void) return 0; } -static void process_mem_region(struct mem_vector *entry, - unsigned long minimum, - unsigned long image_size) +static void slots_count(struct mem_vector *entry, ^^^ is not suitable. IMO, how about process_mem_slots() or you may have a better name, it's up to you. It's from Kees Cook's advise, he wants to ues slots_count() to match slots_fetch_random() in my old PATCH long long ago. Since the method of handling this part is not changed a lot, so I keep this name. Okay! ;-) dou
Re: [PATCH v4 4/4] x86/boot/KASLR: Limit kaslr to choosing the immovable memory
Hi Fan, At 07/23/2018 05:29 PM, Chao Fan wrote: If 'CONFIG_MEMORY_HOTREMOVE' specified and the account of immovable memory regions is not zero. Calculate the intersection between memory regions from e820/efi memory table and immovable memory regions. Or go on the old code. Rename process_mem_region to slots_count to match slots_fetch_random, and name new function as process_mem_region. Signed-off-by: Chao Fan --- arch/x86/boot/compressed/kaslr.c | 66 ++-- 1 file changed, 55 insertions(+), 11 deletions(-) diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c index 4705682caf1f..10bda3a1fcaa 100644 --- a/arch/x86/boot/compressed/kaslr.c +++ b/arch/x86/boot/compressed/kaslr.c @@ -631,9 +631,9 @@ static unsigned long slots_fetch_random(void) return 0; } -static void process_mem_region(struct mem_vector *entry, - unsigned long minimum, - unsigned long image_size) +static void slots_count(struct mem_vector *entry, ^^^ is not suitable. IMO, how about process_mem_slots() or you may have a better name, it's up to you. + unsigned long minimum, + unsigned long image_size) { struct mem_vector region, overlap; struct slot_area slot_area; slot_area is also unused. Thanks, dou
Re: [PATCH v4 3/4] x86/boot/KASLR: Walk srat tables to filter immovable memory
Hi Fan, At 07/23/2018 05:29 PM, Chao Fan wrote: If 'CONFIG_MEMORY_HOTREMOVE' specified, walk the acpi srat memory tables, store the immovable memory regions, so that kaslr can get the information abouth where can be selected or not. If 'CONFIG_MEMORY_HOTREMOVE' not specified, go on the old code. Signed-off-by: Chao Fan --- arch/x86/boot/compressed/kaslr.c | 55 1 file changed, 55 insertions(+) diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c index 531c9876f573..4705682caf1f 100644 --- a/arch/x86/boot/compressed/kaslr.c +++ b/arch/x86/boot/compressed/kaslr.c @@ -31,6 +31,7 @@ #include "misc.h" #include "error.h" +#include "acpitb.h" #include "../string.h" #include @@ -104,6 +105,14 @@ static bool memmap_too_large; /* Store memory limit specified by "mem=nn[KMG]" or "memmap=nn[KMG]" */ unsigned long long mem_limit = ULLONG_MAX; +#ifdef CONFIG_MEMORY_HOTREMOVE +/* Store the immovable memory regions */ +struct mem_vector immovable_mem[]; + +/* Store the amount of immovable memory regions */ +static int num_immovable_mem; +#endif + enum mem_avoid_index { MEM_AVOID_ZO_RANGE = 0, @@ -298,6 +307,47 @@ static int handle_mem_options(void) return 0; } +#ifdef CONFIG_MEMORY_HOTREMOVE +/* + * According to ACPI table, filter the immvoable memory regions + * and store them in immovable_mem[]. + */ +static void handle_immovable_mem(void) +{ + struct acpi_table_header *table_header; + struct acpi_subtable_header *table; + struct acpi_srat_mem_affinity *ma; + unsigned long table_size; + unsigned long table_end; + int i = 0; + + table_header = get_acpi_srat_table(); + + if (!table_header) + return; + + table_end = (unsigned long)table_header + table_header->length; + + table = (struct acpi_subtable_header *) + ((unsigned long)table_header + sizeof(struct acpi_table_srat)); + + table_size = sizeof(struct acpi_subtable_header); table_size isn't used, can be remove. + while (((unsigned long)table) + table->length < table_end) { + if (table->type == 1) { + ma = (struct acpi_srat_mem_affinity *)table; + if (!(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE)) { + immovable_mem[i].start = ma->base_address; + immovable_mem[i].size = ma->length; + i++; need a check(i < MAX_NUMNODES*2) before doing that in case of __IndexOutOfBoundsException__ even if it may be impossible in ACPI. Thanks, dou + } + } + table = (struct acpi_subtable_header *) + ((unsigned long)table + table->length); + } + num_immovable_mem = i; +} +#endif + /* * In theory, KASLR can put the kernel anywhere in the range of [16M, 64T). * The mem_avoid array is used to store the ranges that need to be avoided @@ -421,6 +471,11 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size, /* Mark the memmap regions we need to avoid */ handle_mem_options(); +#ifdef CONFIG_MEMORY_HOTREMOVE + /* Mark the immovable regions we need to choose */ + handle_immovable_mem(); +#endif + #ifdef CONFIG_X86_VERBOSE_BOOTUP /* Make sure video RAM can be used. */ add_identity_map(0, PMD_SIZE);
[tip:x86/platform] x86/platform/UV: Mark memblock related init code and data correctly
Commit-ID: 24cfd8ca1d28331b9dad3b88d1958c976b2cfab6 Gitweb: https://git.kernel.org/tip/24cfd8ca1d28331b9dad3b88d1958c976b2cfab6 Author: Dou Liyang AuthorDate: Mon, 30 Jul 2018 15:59:47 +0800 Committer: Thomas Gleixner CommitDate: Mon, 30 Jul 2018 19:53:58 +0200 x86/platform/UV: Mark memblock related init code and data correctly parse_mem_block_size() and mem_block_size are only used during init. Mark them accordingly. Fixes: d7609f4210cb ("x86/platform/UV: Add kernel parameter to set memory block size") Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: h...@zytor.com Cc: Mike Travis Cc: Andrew Banman Link: https://lkml.kernel.org/r/20180730075947.23023-1-douly.f...@cn.fujitsu.com --- arch/x86/kernel/apic/x2apic_uv_x.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c b/arch/x86/kernel/apic/x2apic_uv_x.c index d492752f79e1..391f358ebb4c 100644 --- a/arch/x86/kernel/apic/x2apic_uv_x.c +++ b/arch/x86/kernel/apic/x2apic_uv_x.c @@ -394,10 +394,10 @@ extern int uv_hub_info_version(void) EXPORT_SYMBOL(uv_hub_info_version); /* Default UV memory block size is 2GB */ -static unsigned long mem_block_size = (2UL << 30); +static unsigned long mem_block_size __initdata = (2UL << 30); /* Kernel parameter to specify UV mem block size */ -static int parse_mem_block_size(char *ptr) +static int __init parse_mem_block_size(char *ptr) { unsigned long size = memparse(ptr, NULL);
[tip:x86/timers] x86/tsc: Consolidate init code
Commit-ID: 608008a45798fe9e2aee04f99b5270ea57c1376f Gitweb: https://git.kernel.org/tip/608008a45798fe9e2aee04f99b5270ea57c1376f Author: Dou Liyang AuthorDate: Mon, 30 Jul 2018 15:54:20 +0800 Committer: Thomas Gleixner CommitDate: Mon, 30 Jul 2018 19:33:35 +0200 x86/tsc: Consolidate init code Split out suplicated code from tsc_early_init() and tsc_init() into a common helper and fixup some comment typos. [ tglx: Massaged changelog and renamed function ] Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Reviewed-by: Pavel Tatashin Cc: Cc: Peter Zijlstra Cc: "H. Peter Anvin" Link: https://lkml.kernel.org/r/20180730075421.22830-2-douly.f...@cn.fujitsu.com --- arch/x86/kernel/tsc.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 02e416b87ac1..1463468ba9a0 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -182,7 +182,7 @@ static void __init cyc2ns_init_boot_cpu(void) } /* - * Secondary CPUs do not run through cyc2ns_init(), so set up + * Secondary CPUs do not run through tsc_init(), so set up * all the scale factors for all CPUs, assuming the same * speed as the bootup CPU. (cpufreq notifiers will fix this * up if their speed diverges) @@ -1389,7 +1389,7 @@ static bool __init determine_cpu_tsc_frequencies(bool early) } /* -* Trust non-zero tsc_khz as authorative, +* Trust non-zero tsc_khz as authoritative, * and use it to sanity check cpu_khz, * which will be off if system timer is off. */ @@ -1421,6 +1421,14 @@ static unsigned long __init get_loops_per_jiffy(void) return lpj; } +static void __init tsc_enable_sched_clock(void) +{ + /* Sanitize TSC ADJUST before cyc2ns gets initialized */ + tsc_store_and_check_tsc_adjust(true); + cyc2ns_init_boot_cpu(); + static_branch_enable(&__use_tsc); +} + void __init tsc_early_init(void) { if (!boot_cpu_has(X86_FEATURE_TSC)) @@ -1429,10 +1437,7 @@ void __init tsc_early_init(void) return; loops_per_jiffy = get_loops_per_jiffy(); - /* Sanitize TSC ADJUST before cyc2ns gets initialized */ - tsc_store_and_check_tsc_adjust(true); - cyc2ns_init_boot_cpu(); - static_branch_enable(&__use_tsc); + tsc_enable_sched_clock(); } void __init tsc_init(void) @@ -1456,13 +1461,10 @@ void __init tsc_init(void) setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); return; } - /* Sanitize TSC ADJUST before cyc2ns gets initialized */ - tsc_store_and_check_tsc_adjust(true); - cyc2ns_init_boot_cpu(); + tsc_enable_sched_clock(); } cyc2ns_init_secondary_cpus(); - static_branch_enable(&__use_tsc); if (!no_sched_irq_time) enable_sched_clock_irqtime();
[tip:x86/timers] x86/kvmclock: Mark kvm_get_preset_lpj() as __init
Commit-ID: 1088c6eef261939bda8346ec35b513790a2111d5 Gitweb: https://git.kernel.org/tip/1088c6eef261939bda8346ec35b513790a2111d5 Author: Dou Liyang AuthorDate: Mon, 30 Jul 2018 15:54:21 +0800 Committer: Thomas Gleixner CommitDate: Mon, 30 Jul 2018 19:33:35 +0200 x86/kvmclock: Mark kvm_get_preset_lpj() as __init kvm_get_preset_lpj() is only called from kvmclock_init(), so mark it __init as well. Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Reviewed-by: Pavel Tatashin Cc: Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Cc: "H. Peter Anvin" Cc: "Radim Krčmář" Cc: k...@vger.kernel.org Link: https://lkml.kernel.org/r/20180730075421.22830-3-douly.f...@cn.fujitsu.com --- arch/x86/kernel/kvmclock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 91b94c0ae4e3..d2edd7e6c294 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -145,7 +145,7 @@ static unsigned long kvm_get_tsc_khz(void) return pvclock_tsc_khz(this_cpu_pvti()); } -static void kvm_get_preset_lpj(void) +static void __init kvm_get_preset_lpj(void) { unsigned long khz; u64 lpj;
Re: [PATCH 1/2] x86/tsc: Avoid a double call if TSC initializes earlier.
Hi Peter, At 07/30/2018 05:34 PM, Peter Zijlstra wrote: On Mon, Jul 30, 2018 at 03:54:20PM +0800, Dou Liyang wrote: static_branch_enable(&__use_tsc) may be called twice in common case, that is redundant. It is also really not a problem... Yes, right. Just avoid the second useless setting. Thanks, dou
[PATCH] x86/apic: Mark parse_mem_block_size as __init
The early_param() is only called during kernel initialization, So Linux marks the functions of it with __init macro to save memory. But it forgot to mark parse_mem_block_size(). So, Make it __init as well. Aslo mark mem_block_size variable __initdata. Signed-off-by: Dou Liyang --- arch/x86/kernel/apic/x2apic_uv_x.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c b/arch/x86/kernel/apic/x2apic_uv_x.c index d492752f79e1..391f358ebb4c 100644 --- a/arch/x86/kernel/apic/x2apic_uv_x.c +++ b/arch/x86/kernel/apic/x2apic_uv_x.c @@ -394,10 +394,10 @@ extern int uv_hub_info_version(void) EXPORT_SYMBOL(uv_hub_info_version); /* Default UV memory block size is 2GB */ -static unsigned long mem_block_size = (2UL << 30); +static unsigned long mem_block_size __initdata = (2UL << 30); /* Kernel parameter to specify UV mem block size */ -static int parse_mem_block_size(char *ptr) +static int __init parse_mem_block_size(char *ptr) { unsigned long size = memparse(ptr, NULL); -- 2.14.3
[PATCH 1/2] x86/tsc: Avoid a double call if TSC initializes earlier.
static_branch_enable(&__use_tsc) may be called twice in common case, that is redundant. And there are also some common functions both in tsc_early_init() and tsc_init(). Move them into a separate helper function, only call it once. Also fix comments: -s/authorative/authoritative -s/cyc2ns_init/tsc_init Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Peter Zijlstra Cc: Pavel Tatashin Signed-off-by: Dou Liyang --- arch/x86/kernel/tsc.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 02e416b87ac1..ba0a6b6e0726 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -182,7 +182,7 @@ static void __init cyc2ns_init_boot_cpu(void) } /* - * Secondary CPUs do not run through cyc2ns_init(), so set up + * Secondary CPUs do not run through tsc_init(), so set up * all the scale factors for all CPUs, assuming the same * speed as the bootup CPU. (cpufreq notifiers will fix this * up if their speed diverges) @@ -1389,7 +1389,7 @@ static bool __init determine_cpu_tsc_frequencies(bool early) } /* -* Trust non-zero tsc_khz as authorative, +* Trust non-zero tsc_khz as authoritative, * and use it to sanity check cpu_khz, * which will be off if system timer is off. */ @@ -1421,6 +1421,14 @@ static unsigned long __init get_loops_per_jiffy(void) return lpj; } +static void __init enable_use_tsc(void) +{ + /* Sanitize TSC ADJUST before cyc2ns gets initialized */ + tsc_store_and_check_tsc_adjust(true); + cyc2ns_init_boot_cpu(); + static_branch_enable(&__use_tsc); +} + void __init tsc_early_init(void) { if (!boot_cpu_has(X86_FEATURE_TSC)) @@ -1429,10 +1437,7 @@ void __init tsc_early_init(void) return; loops_per_jiffy = get_loops_per_jiffy(); - /* Sanitize TSC ADJUST before cyc2ns gets initialized */ - tsc_store_and_check_tsc_adjust(true); - cyc2ns_init_boot_cpu(); - static_branch_enable(&__use_tsc); + enable_use_tsc(); } void __init tsc_init(void) @@ -1456,13 +1461,10 @@ void __init tsc_init(void) setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); return; } - /* Sanitize TSC ADJUST before cyc2ns gets initialized */ - tsc_store_and_check_tsc_adjust(true); - cyc2ns_init_boot_cpu(); + enable_use_tsc(); } cyc2ns_init_secondary_cpus(); - static_branch_enable(&__use_tsc); if (!no_sched_irq_time) enable_sched_clock_irqtime(); -- 2.14.3
[PATCH 2/2] x86/kvmclock: Mark kvm_get_preset_lpj() as __init
kvm_get_preset_lpj() just be called at kvmclock_init(), So mark it __init as well. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: k...@vger.kernel.org Signed-off-by: Dou Liyang --- arch/x86/kernel/kvmclock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 91b94c0ae4e3..d2edd7e6c294 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -145,7 +145,7 @@ static unsigned long kvm_get_tsc_khz(void) return pvclock_tsc_khz(this_cpu_pvti()); } -static void kvm_get_preset_lpj(void) +static void __init kvm_get_preset_lpj(void) { unsigned long khz; u64 lpj; -- 2.14.3
[PATCH 0/2] Two cleanup patches
Here are two cleanup patches, When I test the early boot time stamps with tip tree today. Dou Liyang (2): x86/tsc: Avoid a double call if tsc can initialize earlier. x86/kvmclock: Mark kvm_get_preset_lpj() as __init arch/x86/kernel/kvmclock.c | 2 +- arch/x86/kernel/tsc.c | 22 -- 2 files changed, 13 insertions(+), 11 deletions(-) -- 2.14.3
Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once
Hi Thomas, At 07/19/2018 02:25 PM, Thomas Gleixner wrote: On Thu, 19 Jul 2018, Dou Liyang wrote: At 07/18/2018 10:22 AM, Pavel Tatashin wrote: + (unsigned long)cpu_khz % KHZ); if (cpu_khz != tsc_khz) { pr_info("Detected %lu.%03lu MHz TSC", - (unsigned long)tsc_khz / 1000, - (unsigned long)tsc_khz % 1000); + (unsigned long)tsc_khz / KHZ, + (unsigned long)tsc_khz % KHZ); + } this curly brackets can be removed No. They want to stay, really. https://lkml.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos The pr_info() is a multiline statement due to the line breaks. I see, I’ll keep that in mind. Thanks, dou
Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once
Hi, Pavel I am sorry, I didn't point out typo clearly in the previous version. Please see the concerns below. ;-) At 07/18/2018 10:22 AM, Pavel Tatashin wrote: During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(), and the second time in tsc_init(). Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so the calibration is done only early, and make tsc_init() to use the values already determined in tsc_early_init(). Sometimes it is not possible to determine tsc early, as the subsystem that is required is not yet initialized, in such case try again later in tsc_init(). Suggested-by: Thomas Gleixner Signed-off-by: Pavel Tatashin --- arch/x86/include/asm/tsc.h | 2 +- arch/x86/kernel/setup.c| 2 +- arch/x86/kernel/tsc.c | 86 -- 3 files changed, 48 insertions(+), 42 deletions(-) diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h index 2701d221583a..c4368ff73652 100644 --- a/arch/x86/include/asm/tsc.h +++ b/arch/x86/include/asm/tsc.h @@ -33,7 +33,7 @@ static inline cycles_t get_cycles(void) extern struct system_counterval_t convert_art_to_tsc(u64 art); extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns); -extern void tsc_early_delay_calibrate(void); +extern void tsc_early_init(void); extern void tsc_init(void); extern void mark_tsc_unstable(char *reason); extern int unsynchronized_tsc(void); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 7490de925a81..5d32c55aeb8b 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1014,6 +1014,7 @@ void __init setup_arch(char **cmdline_p) */ init_hypervisor_platform(); + tsc_early_init(); x86_init.resources.probe_roms(); /* after parse_early_param, so could debug it */ @@ -1199,7 +1200,6 @@ void __init setup_arch(char **cmdline_p) memblock_find_dma_reserve(); - tsc_early_delay_calibrate(); if (!early_xdbc_setup_hardware()) early_xdbc_register_console(); diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 186395041725..bc8eb82050a3 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -33,6 +33,8 @@ EXPORT_SYMBOL(cpu_khz); unsigned int __read_mostly tsc_khz; EXPORT_SYMBOL(tsc_khz); +#define KHZ 1000 + /* * TSC can be unstable due to cpufreq or due to unsynced TSCs */ @@ -1335,34 +1337,10 @@ static int __init init_tsc_clocksource(void) */ device_initcall(init_tsc_clocksource); -void __init tsc_early_delay_calibrate(void) -{ - unsigned long lpj; - - if (!boot_cpu_has(X86_FEATURE_TSC)) - return; - - cpu_khz = x86_platform.calibrate_cpu(); - tsc_khz = x86_platform.calibrate_tsc(); - - tsc_khz = tsc_khz ? : cpu_khz; - if (!tsc_khz) - return; - - lpj = tsc_khz * 1000; - do_div(lpj, HZ); - loops_per_jiffy = lpj; -} - -void __init tsc_init(void) +static bool determine_cpu_tsc_frequncies(void) this function need to be mark as __init. And a typo here: frequency, s/frequncies/frequencies { - u64 lpj, cyc; - int cpu; - - if (!boot_cpu_has(X86_FEATURE_TSC)) { - setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); - return; - } + /* Make sure that cpu and tsc are not already calibrated */ + WARN_ON(cpu_khz || tsc_khz); cpu_khz = x86_platform.calibrate_cpu(); tsc_khz = x86_platform.calibrate_tsc(); @@ -1377,20 +1355,51 @@ void __init tsc_init(void) else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz) cpu_khz = tsc_khz; - if (!tsc_khz) { - mark_tsc_unstable("could not calculate TSC khz"); - setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); - return; - } + if (tsc_khz == 0) + return false; pr_info("Detected %lu.%03lu MHz processor\n", - (unsigned long)cpu_khz / 1000, - (unsigned long)cpu_khz % 1000); + (unsigned long)cpu_khz / KHZ, + (unsigned long)cpu_khz % KHZ); if (cpu_khz != tsc_khz) { pr_info("Detected %lu.%03lu MHz TSC", - (unsigned long)tsc_khz / 1000, - (unsigned long)tsc_khz % 1000); + (unsigned long)tsc_khz / KHZ, + (unsigned long)tsc_khz % KHZ); + } this curly brackets can be removed + return true; +} + +static unsigned long get_loops_per_jiffy(void) mark as __init as well. +{ + unsigned long lpj = tsc_khz * KHZ; + + do_div(lpj, HZ); + return lpj; +} + +void __init tsc_early_init(void) +{ + if (!boot_cpu_has(X86_FEATURE_TSC)) + return; + if (!determine_cpu_tsc_frequncies()) + return; + loops_per_jiffy = get_loops_per_jiffy(); +} + +void __init tsc_init(void) +{ +
Re: [PATCH v13 13/18] x86/tsc: calibrate tsc only once
At 07/16/2018 09:35 PM, Pavel Tatashin wrote: [...] v13 has xen patches, so xen timestamps work early as well now. Oops, yes, my mistake. I will test the patchset with Thomas's kvm patch for you. Thanks, dou Thank you, Pavel
Re: [PATCH v13 13/18] x86/tsc: calibrate tsc only once
At 07/13/2018 07:30 PM, Pavel Tatashin wrote: On Fri, Jul 13, 2018 at 3:24 AM Dou Liyang wrote: At 07/12/2018 08:04 AM, Pavel Tatashin wrote: During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(), and the second time in tsc_init(). Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so the calibration is done only early, and make tsc_init() to use the values already determined in tsc_early_init(). Sometimes it is not possible to determine tsc early, as the subsystem that is required is not yet initialized, in such case try again later in tsc_init(). Suggested-by: Thomas Gleixner Signed-off-by: Pavel Tatashin Hi Pavel, Aha, a complex solution for a simple problem! ;-) And I did find any benefits of doing that. did I miss something? Hi Dou, I had this in previous version: init early, and unconditionally re-init later (which required to resync sched clocks for continuity, and check for some other corner cases). Thomas did not like the idea, as it is less deterministic: it leads for code to work by accident, where we might get one tsc frequency early and another later, and so on. The request was to initialize only once, and if that fails do it again later. This way, if early initialization is broken, we will know and fix it. Hi Pavel, Yes, right, I have seen the purpose in v12. As the cpu_khz and tsc_khz are global variables and the tsc_khz may be reset to cpu_khz. How about the following patch. Could you please explain where you think this patch can be applied, and what it fixes? This patch is just an simple alternative to realize what you want in your patch. your patch is also good but complex, and need some scrub. eg: - Is it suitable to using the WARN_ON() - the name of determine_cpu_tsc_frequncies() function, s/frequncies/frequencies/. BTW, How about tsc_calibrate_frequencies() - ... BTW, before this patch, seems we need make sure xen should work well, I can investigate and try to test if we can also move the pagetable_init() to the front of tsc_early_init() for you. Thanks, dou diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index da1dbd99cb6e..74cb16d89e25 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1197,12 +1197,12 @@ void __init setup_arch(char **cmdline_p) memblock_find_dma_reserve(); + x86_init.paging.pagetable_init(); + tsc_early_delay_calibrate(); if (!early_xdbc_setup_hardware()) early_xdbc_register_console(); - x86_init.paging.pagetable_init(); - kasan_init(); /* Thank you, Pavel Thanks, dou 8<--- diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 74392d9d51e0..e54fa1037d45 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1370,8 +1370,10 @@ void __init tsc_init(void) return; } - cpu_khz = x86_platform.calibrate_cpu(); - tsc_khz = x86_platform.calibrate_tsc(); + if (!tsc_khz) { + cpu_khz = x86_platform.calibrate_cpu(); + tsc_khz = x86_platform.calibrate_tsc(); + } /* * Trust non-zero tsc_khz as authorative,
Re: [PATCH v13 14/18] x86/tsc: initialize cyc2ns when tsc freq. is determined
At 07/12/2018 08:04 AM, Pavel Tatashin wrote: cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data structure. Currently, the setup code for c2ns data for every possible CPU goes through the same sequence of calculations as for the boot CPU, but is based on the same tsc frequency as the boot CPU, and thus this is not necessary. Initialize the boot cpu when tsc frequency is determined. Copy the calculated data from the boot CPU to the other CPUs in tsc_init(). In addition do the following: - Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init() - Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be called when system is up, and wraps around __set_cyc2ns_scale() that can be called directly when system is booting but avoids saving restoring IRQs and going and waking up from idle. Suggested-by: Thomas Gleixner Signed-off-by: Pavel Tatashin --- arch/x86/kernel/tsc.c | 94 --- 1 file changed, 53 insertions(+), 41 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index bc8eb82050a3..0b1abe7fdd8e 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -103,23 +103,6 @@ void cyc2ns_read_end(void) * -johns...@us.ibm.com "math is hard, lets go shopping!" */ -static void cyc2ns_data_init(struct cyc2ns_data *data) -{ - data->cyc2ns_mul = 0; - data->cyc2ns_shift = 0; - data->cyc2ns_offset = 0; -} - -static void __init cyc2ns_init(int cpu) -{ - struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu); - - cyc2ns_data_init(&c2n->data[0]); - cyc2ns_data_init(&c2n->data[1]); - - seqcount_init(&c2n->seq); -} - static inline unsigned long long cycles_2_ns(unsigned long long cyc) { struct cyc2ns_data data; @@ -135,18 +118,11 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc) return ns; } -static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now) +static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now) { unsigned long long ns_now; struct cyc2ns_data data; struct cyc2ns *c2n; - unsigned long flags; - - local_irq_save(flags); - sched_clock_idle_sleep_event(); - - if (!khz) - goto done; ns_now = cycles_2_ns(tsc_now); @@ -178,12 +154,55 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_ c2n->data[0] = data; raw_write_seqcount_latch(&c2n->seq); c2n->data[1] = data; +} + +static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now) +{ + unsigned long flags; + + local_irq_save(flags); + sched_clock_idle_sleep_event(); + + if (khz) + __set_cyc2ns_scale(khz, cpu, tsc_now); -done: sched_clock_idle_wakeup_event(); local_irq_restore(flags); } > +/* + * Initialize cyc2ns for boot cpu + */ +static void __init cyc2ns_init_boot_cpu(void) +{ + struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns); + + seqcount_init(&c2n->seq); + __set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc()); +} Hi Paval, Here we didn't protect it with local_irq_save()/local_irq_restore() + +/* + * Secondary CPUs do not run through cyc2ns_init(), so set up + * all the scale factors for all CPUs, assuming the same + * speed as the bootup CPU. (cpufreq notifiers will fix this + * up if their speed diverges) + */ +static void __init cyc2ns_init_secondary_cpus(void) +{ + unsigned int cpu, this_cpu = smp_processor_id(); + struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns); + struct cyc2ns_data *data = c2n->data; + + for_each_possible_cpu(cpu) { + if (cpu != this_cpu) { + seqcount_init(&c2n->seq); + c2n = per_cpu_ptr(&cyc2ns, cpu); + c2n->data[0] = data[0]; + c2n->data[1] = data[1]; + } + } +} + /* * Scheduler clock - returns current time in nanosec units. */ @@ -1385,6 +1404,10 @@ void __init tsc_early_init(void) if (!determine_cpu_tsc_frequncies()) return; loops_per_jiffy = get_loops_per_jiffy(); + + /* Sanitize TSC ADJUST before cyc2ns gets initialized */ + tsc_store_and_check_tsc_adjust(true); + cyc2ns_init_boot_cpu(); } void __init tsc_init(void) @@ -1400,23 +1423,12 @@ void __init tsc_init(void) mark_tsc_unstable("could not calculate TSC khz"); return; } + /* Sanitize TSC ADJUST before cyc2ns gets initialized */ + tsc_store_and_check_tsc_adjust(true); + cyc2ns_init_boot_cpu(); In tsc_init(), the timer interrupt has worked, Is that OK? IMO, no need to change so much, just use the original code, eg: +static void __init cyc2ns_
Re: [PATCH v13 13/18] x86/tsc: calibrate tsc only once
At 07/12/2018 08:04 AM, Pavel Tatashin wrote: During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(), and the second time in tsc_init(). Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so the calibration is done only early, and make tsc_init() to use the values already determined in tsc_early_init(). Sometimes it is not possible to determine tsc early, as the subsystem that is required is not yet initialized, in such case try again later in tsc_init(). Suggested-by: Thomas Gleixner Signed-off-by: Pavel Tatashin Hi Pavel, Aha, a complex solution for a simple problem! ;-) And I did find any benefits of doing that. did I miss something? As the cpu_khz and tsc_khz are global variables and the tsc_khz may be reset to cpu_khz. How about the following patch. Thanks, dou 8<--- diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 74392d9d51e0..e54fa1037d45 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1370,8 +1370,10 @@ void __init tsc_init(void) return; } - cpu_khz = x86_platform.calibrate_cpu(); - tsc_khz = x86_platform.calibrate_tsc(); + if (!tsc_khz) { + cpu_khz = x86_platform.calibrate_cpu(); + tsc_khz = x86_platform.calibrate_tsc(); + } /* * Trust non-zero tsc_khz as authorative, --- arch/x86/include/asm/tsc.h | 2 +- arch/x86/kernel/setup.c| 2 +- arch/x86/kernel/tsc.c | 86 -- 3 files changed, 48 insertions(+), 42 deletions(-) diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h index 2701d221583a..c4368ff73652 100644 --- a/arch/x86/include/asm/tsc.h +++ b/arch/x86/include/asm/tsc.h @@ -33,7 +33,7 @@ static inline cycles_t get_cycles(void) extern struct system_counterval_t convert_art_to_tsc(u64 art); extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns); -extern void tsc_early_delay_calibrate(void); +extern void tsc_early_init(void); extern void tsc_init(void); extern void mark_tsc_unstable(char *reason); extern int unsynchronized_tsc(void); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 01fcc8bf7c8f..07445482bb57 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1014,7 +1014,7 @@ void __init setup_arch(char **cmdline_p) */ init_hypervisor_platform(); - tsc_early_delay_calibrate(); + tsc_early_init(); x86_init.resources.probe_roms(); diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 186395041725..bc8eb82050a3 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -33,6 +33,8 @@ EXPORT_SYMBOL(cpu_khz); unsigned int __read_mostly tsc_khz; EXPORT_SYMBOL(tsc_khz); +#define KHZ 1000 + /* * TSC can be unstable due to cpufreq or due to unsynced TSCs */ @@ -1335,34 +1337,10 @@ static int __init init_tsc_clocksource(void) */ device_initcall(init_tsc_clocksource); -void __init tsc_early_delay_calibrate(void) -{ - unsigned long lpj; - - if (!boot_cpu_has(X86_FEATURE_TSC)) - return; - - cpu_khz = x86_platform.calibrate_cpu(); - tsc_khz = x86_platform.calibrate_tsc(); - - tsc_khz = tsc_khz ? : cpu_khz; - if (!tsc_khz) - return; - - lpj = tsc_khz * 1000; - do_div(lpj, HZ); - loops_per_jiffy = lpj; -} - -void __init tsc_init(void) +static bool determine_cpu_tsc_frequncies(void) { - u64 lpj, cyc; - int cpu; - - if (!boot_cpu_has(X86_FEATURE_TSC)) { - setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); - return; - } + /* Make sure that cpu and tsc are not already calibrated */ + WARN_ON(cpu_khz || tsc_khz); cpu_khz = x86_platform.calibrate_cpu(); tsc_khz = x86_platform.calibrate_tsc(); @@ -1377,20 +1355,51 @@ void __init tsc_init(void) else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz) cpu_khz = tsc_khz; - if (!tsc_khz) { - mark_tsc_unstable("could not calculate TSC khz"); - setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); - return; - } + if (tsc_khz == 0) + return false; pr_info("Detected %lu.%03lu MHz processor\n", - (unsigned long)cpu_khz / 1000, - (unsigned long)cpu_khz % 1000); + (unsigned long)cpu_khz / KHZ, + (unsigned long)cpu_khz % KHZ); if (cpu_khz != tsc_khz) { pr_info("Detected %lu.%03lu MHz TSC", - (unsigned long)tsc_khz / 1000, - (unsigned long)tsc_khz % 1000); + (unsigned long)tsc_khz / KHZ, + (unsigned long)tsc_khz % KHZ); + } + return true; +} + +static unsigned long get_loops_per_jiffy(void) +{ +
Re: Bug report about KASLR and ZONE_MOVABLE
Hi Baoquan, At 07/11/2018 08:40 PM, Baoquan He wrote: Please try this v3 patch: >>From 9850d3de9c02e570dc7572069a9749a8add4c4c7 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Wed, 11 Jul 2018 20:31:51 +0800 Subject: [PATCH v3] mm, page_alloc: find movable zone after kernel text In find_zone_movable_pfns_for_nodes(), when try to find the starting PFN movable zone begins in each node, kernel text position is not considered. KASLR may put kernel after which movable zone begins. Fix it by finding movable zone after kernel text on that node. Signed-off-by: Baoquan He You fix this in the _zone_init side_. This may make the 'kernelcore=' or 'movablecore=' failed if the KASLR puts the kernel back the tail of the last node, or more. Due to we have fix the mirror memory in KASLR side, and Chao is trying to fix the 'movable_node' in KASLR side. Have you had a chance to fix this in the KASLR side. --- mm/page_alloc.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1521100..390eb35 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6547,7 +6547,7 @@ static unsigned long __init early_calculate_totalpages(void) static void __init find_zone_movable_pfns_for_nodes(void) { int i, nid; - unsigned long usable_startpfn; + unsigned long usable_startpfn, real_startpfn; unsigned long kernelcore_node, kernelcore_remaining; /* save the state before borrow the nodemask */ nodemask_t saved_node_state = node_states[N_MEMORY]; @@ -6681,10 +6681,20 @@ static void __init find_zone_movable_pfns_for_nodes(void) if (start_pfn >= end_pfn) continue; + /* +* KASLR may put kernel near tail of node memory, +* start after kernel on that node to find PFN +* which zone begins. +*/ + if (pfn_to_nid(PFN_UP(_etext)) == i) Here, did you want to check the Node id? seems may be nid. and for_each_node_state(nid, N_MEMORY) { ... seems check here is more suitable. for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { } } Thanks, dou + real_startpfn = max(usable_startpfn, + PFN_UP(_etext)) + else + real_startpfn = usable_startpfn; /* Account for what is only usable for kernelcore */ - if (start_pfn < usable_startpfn) { + if (start_pfn < real_startpfn) { unsigned long kernel_pages; - kernel_pages = min(end_pfn, usable_startpfn) + kernel_pages = min(end_pfn, real_startpfn) - start_pfn; kernelcore_remaining -= min(kernel_pages, @@ -6693,7 +6703,7 @@ static void __init find_zone_movable_pfns_for_nodes(void) required_kernelcore); /* Continue if range is now fully accounted */ - if (end_pfn <= usable_startpfn) { + if (end_pfn <= real_startpfn) { /* * Push zone_movable_pfn to the end so @@ -6704,7 +6714,7 @@ static void __init find_zone_movable_pfns_for_nodes(void) zone_movable_pfn[nid] = end_pfn; continue; } - start_pfn = usable_startpfn; + start_pfn = real_startpfn; } /*
[tip:x86/urgent] x86/vector: Fix the args of vector_alloc tracepoint
Commit-ID: 838d76d63ec4eaeaa12bedfa50f261480f615200 Gitweb: https://git.kernel.org/tip/838d76d63ec4eaeaa12bedfa50f261480f615200 Author: Dou Liyang AuthorDate: Fri, 1 Jun 2018 14:50:31 +0800 Committer: Thomas Gleixner CommitDate: Wed, 6 Jun 2018 13:38:02 +0200 x86/vector: Fix the args of vector_alloc tracepoint The vector_alloc tracepont reversed the reserved and ret aggs, that made the trace print wrong. Exchange them. Fixes: 8d1e3dca7de6 ("x86/vector: Add tracepoints for vector management") Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: h...@zytor.com Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180601065031.21872-1-douly.f...@cn.fujitsu.com --- arch/x86/include/asm/trace/irq_vectors.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/include/asm/trace/irq_vectors.h b/arch/x86/include/asm/trace/irq_vectors.h index 22647a642e98..0af81b590a0c 100644 --- a/arch/x86/include/asm/trace/irq_vectors.h +++ b/arch/x86/include/asm/trace/irq_vectors.h @@ -236,7 +236,7 @@ TRACE_EVENT(vector_alloc, TP_PROTO(unsigned int irq, unsigned int vector, bool reserved, int ret), - TP_ARGS(irq, vector, ret, reserved), + TP_ARGS(irq, vector, reserved, ret), TP_STRUCT__entry( __field(unsigned int, irq )
[tip:x86/urgent] x86/idt: Simplify the idt_setup_apic_and_irq_gates()
Commit-ID: 336628128826a9acb045571a960e32e4414ccb61 Gitweb: https://git.kernel.org/tip/336628128826a9acb045571a960e32e4414ccb61 Author: Dou Liyang AuthorDate: Wed, 23 May 2018 10:35:55 +0800 Committer: Thomas Gleixner CommitDate: Wed, 6 Jun 2018 13:38:01 +0200 x86/idt: Simplify the idt_setup_apic_and_irq_gates() The idt_setup_apic_and_irq_gates() sets the gates from FIRST_EXTERNAL_VECTOR up to FIRST_SYSTEM_VECTOR first. then secondly, from FIRST_SYSTEM_VECTOR to NR_VECTORS, it takes both APIC=y and APIC=n into account. But for APIC=n, the FIRST_SYSTEM_VECTOR is equal to NR_VECTORS, all vectors has been set at the first step. Simplify the second step, make it just work for APIC=y. Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20180523023555.2933-1-douly.f...@cn.fujitsu.com --- arch/x86/kernel/idt.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index 2c3a1b4294eb..74383a3780dc 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -317,15 +317,12 @@ void __init idt_setup_apic_and_irq_gates(void) set_intr_gate(i, entry); } - for_each_clear_bit_from(i, system_vectors, NR_VECTORS) { #ifdef CONFIG_X86_LOCAL_APIC + for_each_clear_bit_from(i, system_vectors, NR_VECTORS) { set_bit(i, system_vectors); set_intr_gate(i, spurious_interrupt); -#else - entry = irq_entries_start + 8 * (i - FIRST_EXTERNAL_VECTOR); - set_intr_gate(i, entry); -#endif } +#endif } /**
Re: [patch 3/8] x86/apic: Provide apic_ack_irq()
Hi Thomas, At 06/06/2018 04:04 PM, Thomas Gleixner wrote: On Wed, 6 Jun 2018, Dou Liyang wrote: Hi Thomas, At 06/05/2018 07:41 PM, Thomas Gleixner wrote: On Tue, 5 Jun 2018, Dou Liyang wrote: +{ + if (unlikely(irqd_is_setaffinity_pending(irqd))) Affinity pending is also judged in + irq_move_irq(irqd); If we can remove the if(...) statement here That requires to fix all call sites in ia64 and that's why I didn't. But I didn't express clearly, I meant remove the if(...) statement from apic_ack_irq(), it doesn't require to fix the call sites in ia64. I put the check there on purpose as I explained in the changelog: Making the invocation of irq_move_irq() conditional avoids the out of line call if the pending bit is not set. I completely understand now, thank you so much. :-) Thanks, dou Thanks, tglx
Re: [patch 3/8] x86/apic: Provide apic_ack_irq()
Hi Thomas, At 06/05/2018 07:41 PM, Thomas Gleixner wrote: On Tue, 5 Jun 2018, Dou Liyang wrote: +{ + if (unlikely(irqd_is_setaffinity_pending(irqd))) Affinity pending is also judged in + irq_move_irq(irqd); If we can remove the if(...) statement here That requires to fix all call sites in ia64 and that's why I didn't. But I didn't express clearly, I meant remove the if(...) statement from apic_ack_irq(), it doesn't require to fix the call sites in ia64. +void apic_ack_irq(struct irq_data *irqd) +{ + irq_move_irq(irqd); + ack_APIC_irq(); +} BTW, If apic_ack_irq() can accept _any_ irq_data when hierarchical irqdomains are enabled[1]? If it is true, If there is a situation in the original code that we should avoid: If the top-level irq_data has the IRQD_SETAFFINITY_PENDING state, but non-top-level irq_data state not, when using non-top-level irq_data in apic_ack_irq(), we may skip the irq_move_irq() which we should call. [1] commit 77ed42f18edd("genirq: Prevent crash in irq_move_irq()") we can make irq_move_irq() an inline function and have the check in the inline. I don't know why do we need to make irq_move_irq() an inline function. And, yes, irq_move_irq() has already had the check ... if (likely(!irqd_is_setaffinity_pending(idata))) return; ... Thanks, dou
Re: [patch 3/8] x86/apic: Provide apic_ack_irq()
Hi Thomas, At 06/04/2018 11:33 PM, Thomas Gleixner wrote: apic_ack_edge() is explicitely for handling interrupt affinity cleanup when interrupt remapping is not available or disable. Remapped interrupts and also some of the platform specific special interrupts, e.g. UV, invoke ack_APIC_irq() directly. To address the issue of failing an affinity update with -EBUSY the delayed affinity mechanism can be reused, but ack_APIC_irq() does not handle that. Adding this to ack_APIC_irq() is not possible, because that function is also used for exceptions and directly handled interrupts like IPIs. Create a new function, which just contains the conditional invocation of irq_move_irq() and the final ack_APIC_irq(). Making the invocation of irq_move_irq() conditional avoids the out of line call if the pending bit is not set. Reuse the new function in apic_ack_edge(). Preparatory change for the real fix Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup") Signed-off-by: Thomas Gleixner Cc:sta...@vger.kernel.org --- arch/x86/include/asm/apic.h |2 ++ arch/x86/kernel/apic/vector.c | 10 -- 2 files changed, 10 insertions(+), 2 deletions(-) --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -436,6 +436,8 @@ static inline void apic_set_eoi_write(vo #endif /* CONFIG_X86_LOCAL_APIC */ +extern void apic_ack_irq(struct irq_data *data); + static inline void ack_APIC_irq(void) { /* --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -809,11 +809,17 @@ static int apic_retrigger_irq(struct irq return 1; } +void apic_ack_irq(struct irq_data *irqd) +{ + if (unlikely(irqd_is_setaffinity_pending(irqd))) Affinity pending is also judged in + irq_move_irq(irqd); If we can remove the if(...) statement here Thanks, dou + ack_APIC_irq(); +} + void apic_ack_edge(struct irq_data *irqd) { irq_complete_move(irqd_cfg(irqd)); - irq_move_irq(irqd); - ack_APIC_irq(); + apic_ack_irq(irqd); } static struct irq_chip lapic_controller = {
Re: [patch 2/8] genirq/generic_pending: Do not lose pending affinity update
Hi Thomas, At 06/04/2018 11:33 PM, Thomas Gleixner wrote: The generic pending interrupt mechanism moves interrupts from the interrupt handler on the original target CPU to the new destination CPU. This is required for x86 and ia64 due to the way the interrupt delivery and acknowledge works if the interrupts are not remapped. However that update can fail for various reasons. Some of them are valid reasons to discard the pending update, but the case, when the previous move has not been fully cleaned up is not a legit reason to fail. Check the return value of irq_do_set_affinity() for -EBUSY, which indicates a pending cleanup, and rearm the pending move in the irq dexcriptor so it's s/dexcriptor/descriptor tried again when the next interrupt arrives. Fixes: 996c591227d9 ("x86/irq: Plug vector cleanup race") Signed-off-by: Thomas Gleixner Cc:sta...@vger.kernel.org --- kernel/irq/migration.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) --- a/kernel/irq/migration.c +++ b/kernel/irq/migration.c @@ -38,17 +38,18 @@ bool irq_fixup_move_pending(struct irq_d void irq_move_masked_irq(struct irq_data *idata) { struct irq_desc *desc = irq_data_to_desc(idata); - struct irq_chip *chip = desc->irq_data.chip; + struct irq_data *data = &desc->irq_data; + struct irq_chip *chip = data->chip; - if (likely(!irqd_is_setaffinity_pending(&desc->irq_data))) + if (likely(!irqd_is_setaffinity_pending(data))) return; - irqd_clr_move_pending(&desc->irq_data); + irqd_clr_move_pending(data); /* * Paranoia: cpu-local interrupts shouldn't be calling in here anyway. */ - if (irqd_is_per_cpu(&desc->irq_data)) { + if (irqd_is_per_cpu(data)) { WARN_ON(1); return; } @@ -73,9 +74,20 @@ void irq_move_masked_irq(struct irq_data * For correct operation this depends on the caller * masking the irqs. */ - if (cpumask_any_and(desc->pending_mask, cpu_online_mask) < nr_cpu_ids) - irq_do_set_affinity(&desc->irq_data, desc->pending_mask, false); + if (cpumask_any_and(desc->pending_mask, cpu_online_mask) < nr_cpu_ids) { + int ret; + ret = irq_do_set_affinity(data, desc->pending_mask, false); + /* +* If the there is a cleanup pending in the underlying s/If the there is/If there is/ Thanks, dou
Re: WARNING and PANIC in irq_matrix_free
Hi Thomas, At 06/04/2018 07:17 PM, Thomas Gleixner wrote: On Mon, 4 Jun 2018, Dou Liyang wrote: Here, why didn't we avoid this cleanup by diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index a75de0792942..0cc59646755f 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data *apicd) */ WARN_ON_ONCE(managed); + if (!vector) + return; + trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; Is there something I didn't consider with? ;-) Well, that just prevents the warning, but the hlist is already corrupted. So you'd just cure the symptom ... I see. I'm about to send a patch series which addresses that. Just need to finish writing changelogs. Thank you for telling me that. Thanks, dou
Re: WARNING and PANIC in irq_matrix_free
Hi Thomas, Sorry to ask the questions at this series, my mailbox was kicked out of the mailing list a few days ago, and didn't receive the e-mail. please see below At 05/29/2018 04:09 AM, Thomas Gleixner wrote: On Mon, 28 May 2018, Song Liu wrote: This doesn't fix the issue with bnxt. Here is a trace with this patch: [...] Thanks for providing the data! > 19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 reserve=0 > 19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 prev_cpu=2 > 19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, total_alloc=15 Here IRQ 31 is shutdown and the vector freed. > 19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, global_rsvd=26, total_alloc=15 > 19d... 1610359249us : vector_reserve: irq=31 ret=0 > 19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 apicdest=0x And set to the magic reservation vector 239 to catch spurious interrupts. > 20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 can_reserve=1 reserve=0 > 20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 reserved=1 ret=0 > 20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16 > 20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 prev_vector=0 prev_cpu=20 this means there is no associated previous vector. > 20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0 > 20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 apicdest=0x0014 So here it gets initialized again and targets CPU9 now. > 20dN.. 161032us : irq_matrix_alloc: bit=33 cpu=20 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16 > 20dN.. 161032us : vector_update: irq=31 vector=33 cpu=20 prev_vector=33 prev_cpu=9 > 20dN.. 161032us : vector_alloc: irq=31 vector=33 reserved=1 ret=0 > 20dN.. 161032us : vector_config: irq=31 vector=33 cpu=20 apicdest=0x002c Here it is reconfigured to CPU 20. Now that update schedules vector 33 on CPU9 for cleanup. > 20dN.. 161036us : irq_matrix_alloc: bit=34 cpu=2 online=1 avl=199 alloc=2 managed=0 online_maps=56 global_avl=11239, global_rsvd=28, total_alloc=17 > 20dN.. 161036us : vector_update: irq=31 vector=34 cpu=2 prev_vector=33 prev_cpu=20 > 20dN.. 161036us : vector_alloc: irq=31 vector=34 reserved=1 ret=0 > 20dN.. 161036us : vector_config: irq=31 vector=34 cpu=2 apicdest=0x0004 So here the shit hits the fan because that update schedules vector 33 on CPU20 for cleanup while the previous cleanup for CPU9 has not been done yet. Cute. or not so cute > 20dNh. 161039us : vector_free_moved: irq=31 cpu=20 vector=33 is_managed=0 > 20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16 And frees the CPU 20 vector > 9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 is_managed=0 Here, why didn't we avoid this cleanup by diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index a75de0792942..0cc59646755f 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data *apicd) */ WARN_ON_ONCE(managed); + if (!vector) + return; + trace_vector_free_moved(apicd->irq, cpu, vector, managed); irq_matrix_free(vector_matrix, cpu, vector, managed); per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED; Is there something I didn't consider with? ;-) Thanks, dou. And then CPU9 claims that it's queued for cleanup. Bah. I'm still working on a fix as the elegant solution refused to work. Thanks, tglx
[PATCH] x86/vector: Fix the args of vector_alloc tracepoint
The vector_alloc tracepont reversed the reserved and ret aggs, that made the trace print wrong. Exchange them. Signed-off-by: Dou Liyang --- arch/x86/include/asm/trace/irq_vectors.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/include/asm/trace/irq_vectors.h b/arch/x86/include/asm/trace/irq_vectors.h index 22647a642e98..0af81b590a0c 100644 --- a/arch/x86/include/asm/trace/irq_vectors.h +++ b/arch/x86/include/asm/trace/irq_vectors.h @@ -236,7 +236,7 @@ TRACE_EVENT(vector_alloc, TP_PROTO(unsigned int irq, unsigned int vector, bool reserved, int ret), - TP_ARGS(irq, vector, ret, reserved), + TP_ARGS(irq, vector, reserved, ret), TP_STRUCT__entry( __field(unsigned int, irq ) -- 2.14.3
[RESEND RFC PATCH] acpi/processor: Remove the check of duplicate processors
We found there are some processors which have the same processor ID but in different PXM in the ACPI namespace. such as this: proc_id |pxm 0 <-> 0 1 <-> 0 2 <-> 1 3 <-> 1 .. 89 <-> 0 89 <-> 1 89 <-> 2 89 <-> 3 .. So we create a mechanism to validate them. make the processor(ID=89) as invalid. And once a processor be hotplugged physically, we check its processor id. Commit 8e089eaa1999 ("acpi: Provide mechanism to validate processors in the ACPI tables") Commit a77d6cd96849 ("acpi/processor: Check for duplicate processor ids at hotplug time") Recently, I found the check mechanism has a bug, it didn't use the acpi_processor_ids_walk() right and always gave us a wrong result. First, I fixed it by modifying the value with AE_OK which is the standard acpi_status value. https://lkml.org/lkml/2018/3/20/273 But, now, I even think this check is useless. my reasons are following: 1). Based on the practical tests, It works well, and no bug be reported 2). Based on the code, the duplicate cases can be dealed with by if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) That seems more reasonable, let's see the following case: Before the patch, After the patch the first processor(ID=89) hot-addfailedsuccess the others processor(ID=89) hot-add failed failed So, Remove the check code. Signed-off-by: Dou Liyang Signed-off-by: Dou Liyang --- drivers/acpi/acpi_processor.c | 126 -- include/linux/acpi.h | 3 - 2 files changed, 129 deletions(-) diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index 449d86d39965..8358708e0bbb 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -281,13 +281,6 @@ static int acpi_processor_get_info(struct acpi_device *device) pr->acpi_id = value; } - if (acpi_duplicate_processor_id(pr->acpi_id)) { - dev_err(&device->dev, - "Failed to get unique processor _UID (0x%x)\n", - pr->acpi_id); - return -ENODEV; - } - pr->phys_id = acpi_get_phys_id(pr->handle, device_declaration, pr->acpi_id); if (invalid_phys_cpuid(pr->phys_id)) @@ -579,127 +572,8 @@ static struct acpi_scan_handler processor_container_handler = { .attach = acpi_processor_container_attach, }; -/* The number of the unique processor IDs */ -static int nr_unique_ids __initdata; - -/* The number of the duplicate processor IDs */ -static int nr_duplicate_ids; - -/* Used to store the unique processor IDs */ -static int unique_processor_ids[] __initdata = { - [0 ... NR_CPUS - 1] = -1, -}; - -/* Used to store the duplicate processor IDs */ -static int duplicate_processor_ids[] = { - [0 ... NR_CPUS - 1] = -1, -}; - -static void __init processor_validated_ids_update(int proc_id) -{ - int i; - - if (nr_unique_ids == NR_CPUS||nr_duplicate_ids == NR_CPUS) - return; - - /* -* Firstly, compare the proc_id with duplicate IDs, if the proc_id is -* already in the IDs, do nothing. -*/ - for (i = 0; i < nr_duplicate_ids; i++) { - if (duplicate_processor_ids[i] == proc_id) - return; - } - - /* -* Secondly, compare the proc_id with unique IDs, if the proc_id is in -* the IDs, put it in the duplicate IDs. -*/ - for (i = 0; i < nr_unique_ids; i++) { - if (unique_processor_ids[i] == proc_id) { - duplicate_processor_ids[nr_duplicate_ids] = proc_id; - nr_duplicate_ids++; - return; - } - } - - /* -* Lastly, the proc_id is a unique ID, put it in the unique IDs. -*/ - unique_processor_ids[nr_unique_ids] = proc_id; - nr_unique_ids++; -} - -static acpi_status __init acpi_processor_ids_walk(acpi_handle handle, - u32 lvl, - void *context, - void **rv) -{ - acpi_status status; - acpi_object_type acpi_type; - unsigned long long uid; - union acpi_object object = { 0 }; - struct acpi_buffer buffer = { sizeof(union acpi_object), &object }; - - status = acpi_get_type(handle, &acpi_type); - if (ACPI_FAILURE(status)) - return false; - - switch (acpi_type) { - case
Re: [PATCH 0/1] Remove the check of duplicate processors
Hi Rafael, At 05/28/2018 04:40 PM, Rafael J. Wysocki wrote: Can you please resend the patch with the above information in the changelog of it? Yes, I resend the patch right now. That would make it easier to review and discuss it. Also I think that we need some sort of a check against duplicate IDs. Thanks, dou
[PATCH v2] x86/idt: Simplify the idt_setup_apic_and_irq_gates()
The idt_setup_apic_and_irq_gates() sets the gates from FIRST_EXTERNAL_VECTOR up to FIRST_SYSTEM_VECTOR first. then secondly, from FIRST_SYSTEM_VECTOR to NR_VECTORS, it takes both APIC=y and APIC=n into account. But for APIC=n, the FIRST_SYSTEM_VECTOR is equal to NR_VECTORS, all vectors has been set at the first step. Simplify the second step, make it just work for APIC=y. Signed-off-by: Dou Liyang --- arch/x86/kernel/idt.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index 2c3a1b4294eb..74383a3780dc 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -317,15 +317,12 @@ void __init idt_setup_apic_and_irq_gates(void) set_intr_gate(i, entry); } - for_each_clear_bit_from(i, system_vectors, NR_VECTORS) { #ifdef CONFIG_X86_LOCAL_APIC + for_each_clear_bit_from(i, system_vectors, NR_VECTORS) { set_bit(i, system_vectors); set_intr_gate(i, spurious_interrupt); -#else - entry = irq_entries_start + 8 * (i - FIRST_EXTERNAL_VECTOR); - set_intr_gate(i, entry); -#endif } +#endif } /** -- 2.14.3
Re: [PATCH 4/5] acpi/processor: Fix the return value of acpi_processor_ids_walk()
At 05/22/2018 09:47 AM, Dou Liyang wrote: At 05/19/2018 11:06 PM, Thomas Gleixner wrote: On Tue, 20 Mar 2018, Dou Liyang wrote: ACPI driver should make sure all the processor IDs in their ACPI Namespace are unique for CPU hotplug. the driver performs a depth-first walk of the namespace tree and calls the acpi_processor_ids_walk(). But, the acpi_processor_ids_walk() will return true if one processor is checked, that cause the walk break after walking pass the first processor. Repace the value with AE_OK which is the standard acpi_status value. Fixes 8c8cb30f49b8 ("acpi/processor: Implement DEVICE operator for processor enumeration") Signed-off-by: Dou Liyang --- drivers/acpi/acpi_processor.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index 449d86d39965..db5bdb59639c 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -663,11 +663,11 @@ static acpi_status __init (acpi_handle handle, } processor_validated_ids_update(uid); - return true; + return AE_OK; err: acpi_handle_info(handle, "Invalid processor object\n"); - return false; + return AE_OK; I'm not sure whether this is the right return value here. Rafael? +Cc Rafael's common used email address. I am sorry, I created the cc list using ./script/get_maintainers.pl ... and didn't check it. Thanks, dou Hi, Thomas, Rafael, Yes, I used AE_OK to make sure it can skip the invalid objects and continue to do the following other objects, I'm also not sure. For this bug, recently, I sent another patch to remove this check code away. https://lkml.org/lkml/2018/5/17/320 IMO, the duplicate IDs can be avoid by the other code if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) 1) As the mapping of cpu_id(pr->id) and processor_id is fixed, when hot-plugging a physical CPU, if its processor_id is duplicated with the present, the above condition 1) will be 0, and Linux will do not add this CPU. And, when every time the system starts, this code will be executed, it will waste more time with the increase in the number of CPU. So I prefer to remove this code. Thanks, dou
Re: [PATCH] x86/idt: Simplify the idt_setup_apic_and_irq_gates()
Hi Thomas, At 05/19/2018 08:32 PM, Thomas Gleixner wrote: On Thu, 26 Apr 2018, Dou Liyang wrote: The vectors between FIRST_SYSTEM_VECTOR and NR_VECTORS are special IRQ vectors used by the SMP architecture. But, if X86_LOCAL_APIC=n, it will not be used, and the FIRST_SYSTEM_VECTOR is equal to NR_VECTORS. Correct, but that function has nothing to do with FIRST_SYSTEM_VECTOR. Oops, sorry, when I reread, my changelog even made me misunderstand. the patch hided it. void __init idt_setup_apic_and_irq_gates(void) ... for_each_clear_bit_from(i, system_vectors, FIRST_SYSTEM_VECTOR) ^^^ ... for_each_clear_bit_from(i, system_vectors, NR_VECTORS) What I want to say is: In the APIC=n system, the FIRST_SYSTEM_VECTOR is equal to NR_VECTORS, So all entries has been set by for_each_clear_bit_from(i, system_vectors, FIRST_SYSTEM_VECTOR) ... the following setup code for APIC=n is redundant. Just setup gates for APIC=y. If it is OK, I will send v2. Thanks, dou diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index 2c3a1b4294eb..8b4174890706 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -317,15 +317,16 @@ void __init idt_setup_apic_and_irq_gates(void) set_intr_gate(i, entry); } - for_each_clear_bit_from(i, system_vectors, NR_VECTORS) { + /* +* If X86_LOCAL_APIC=n, the FIRST_SYSTEM_VECTOR is equal to NR_VECTORS +* Just consider the X86_LOCAL_APIC=y case +*/ #ifdef CONFIG_X86_LOCAL_APIC + for_each_clear_bit_from(i, system_vectors, NR_VECTORS) { set_bit(i, system_vectors); set_intr_gate(i, spurious_interrupt); -#else - entry = irq_entries_start + 8 * (i - FIRST_EXTERNAL_VECTOR); - set_intr_gate(i, entry); -#endif } +#endif That change breaks the LOCAL_APIC=n case in a subtle way. What the function does is to set _ALL_ entries starting from FIRST_EXTERNAL_VECTOR up to NR_VECTORS in the IDT to a known target, except those which are already occupied by a system vector. So for APIC=y this sets them to: spurious vector and for APIC=n it sets it to the corresppnding vector entry. You remove the latter... > Thanks, tglx
Re: [PATCH 4/5] acpi/processor: Fix the return value of acpi_processor_ids_walk()
At 05/19/2018 11:06 PM, Thomas Gleixner wrote: On Tue, 20 Mar 2018, Dou Liyang wrote: ACPI driver should make sure all the processor IDs in their ACPI Namespace are unique for CPU hotplug. the driver performs a depth-first walk of the namespace tree and calls the acpi_processor_ids_walk(). But, the acpi_processor_ids_walk() will return true if one processor is checked, that cause the walk break after walking pass the first processor. Repace the value with AE_OK which is the standard acpi_status value. Fixes 8c8cb30f49b8 ("acpi/processor: Implement DEVICE operator for processor enumeration") Signed-off-by: Dou Liyang --- drivers/acpi/acpi_processor.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index 449d86d39965..db5bdb59639c 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -663,11 +663,11 @@ static acpi_status __init (acpi_handle handle, } processor_validated_ids_update(uid); - return true; + return AE_OK; err: acpi_handle_info(handle, "Invalid processor object\n"); - return false; + return AE_OK; I'm not sure whether this is the right return value here. Rafael? Hi, Thomas, Rafael, Yes, I used AE_OK to make sure it can skip the invalid objects and continue to do the following other objects, I'm also not sure. For this bug, recently, I sent another patch to remove this check code away. https://lkml.org/lkml/2018/5/17/320 IMO, the duplicate IDs can be avoid by the other code if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) 1) As the mapping of cpu_id(pr->id) and processor_id is fixed, when hot-plugging a physical CPU, if its processor_id is duplicated with the present, the above condition 1) will be 0, and Linux will do not add this CPU. And, when every time the system starts, this code will be executed, it will waste more time with the increase in the number of CPU. So I prefer to remove this code. Thanks, dou
[tip:x86/apic] x86/vector: Merge allocate_vector() into assign_vector_locked()
Commit-ID: 2773397171ac4b6e794ba0b3e34c06cbaf29897a Gitweb: https://git.kernel.org/tip/2773397171ac4b6e794ba0b3e34c06cbaf29897a Author: Dou Liyang AuthorDate: Fri, 11 May 2018 16:09:56 +0800 Committer: Thomas Gleixner CommitDate: Sat, 19 May 2018 15:09:11 +0200 x86/vector: Merge allocate_vector() into assign_vector_locked() assign_vector_locked() calls allocate_vector() to get a real vector for an IRQ. If the current target CPU is online and in the new requested affinity mask, allocate_vector() will return 0 and nothing should be done. But, assign_vector_locked() calls apic_update_irq_cfg() even in that case which is pointless. allocate_vector() is not called from anything else, so the functions can be merged and in case of no change the apic_update_irq_cfg() can be avoided. [ tglx: Massaged changelog ] Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: h...@zytor.com Link: https://lkml.kernel.org/r/20180511080956.6316-1-douly.f...@cn.fujitsu.com --- arch/x86/kernel/apic/vector.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index bb6f7a2148d7..a75de0792942 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -218,7 +218,8 @@ static int reserve_irq_vector(struct irq_data *irqd) return 0; } -static int allocate_vector(struct irq_data *irqd, const struct cpumask *dest) +static int +assign_vector_locked(struct irq_data *irqd, const struct cpumask *dest) { struct apic_chip_data *apicd = apic_chip_data(irqd); bool resvd = apicd->has_reserved; @@ -236,22 +237,12 @@ static int allocate_vector(struct irq_data *irqd, const struct cpumask *dest) return 0; vector = irq_matrix_alloc(vector_matrix, dest, resvd, &cpu); - if (vector > 0) - apic_update_vector(irqd, vector, cpu); trace_vector_alloc(irqd->irq, vector, resvd, vector); - return vector; -} - -static int assign_vector_locked(struct irq_data *irqd, - const struct cpumask *dest) -{ - struct apic_chip_data *apicd = apic_chip_data(irqd); - int vector = allocate_vector(irqd, dest); - if (vector < 0) return vector; + apic_update_vector(irqd, vector, cpu); + apic_update_irq_cfg(irqd, vector, cpu); - apic_update_irq_cfg(irqd, apicd->vector, apicd->cpu); return 0; }
[RFC PATCH 1/1] acpi/processor: Remove the check of duplicate processors
Now, Linux checks the duplicate processor ids at boot time based on the ACPI namespace, and using it at hotplug time. But, In fact, it doesn't work, becasue of the wrong walk in all APIC namespace entries, and even if it would be right, it may miss an hotpluggable CPU(we can hotplug one of the duplicate processor, the others can be ignored). So, Remove the check code. Signed-off-by: Dou Liyang --- drivers/acpi/acpi_processor.c | 126 -- include/linux/acpi.h | 3 - 2 files changed, 129 deletions(-) diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index 449d86d39965..8358708e0bbb 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -281,13 +281,6 @@ static int acpi_processor_get_info(struct acpi_device *device) pr->acpi_id = value; } - if (acpi_duplicate_processor_id(pr->acpi_id)) { - dev_err(&device->dev, - "Failed to get unique processor _UID (0x%x)\n", - pr->acpi_id); - return -ENODEV; - } - pr->phys_id = acpi_get_phys_id(pr->handle, device_declaration, pr->acpi_id); if (invalid_phys_cpuid(pr->phys_id)) @@ -579,127 +572,8 @@ static struct acpi_scan_handler processor_container_handler = { .attach = acpi_processor_container_attach, }; -/* The number of the unique processor IDs */ -static int nr_unique_ids __initdata; - -/* The number of the duplicate processor IDs */ -static int nr_duplicate_ids; - -/* Used to store the unique processor IDs */ -static int unique_processor_ids[] __initdata = { - [0 ... NR_CPUS - 1] = -1, -}; - -/* Used to store the duplicate processor IDs */ -static int duplicate_processor_ids[] = { - [0 ... NR_CPUS - 1] = -1, -}; - -static void __init processor_validated_ids_update(int proc_id) -{ - int i; - - if (nr_unique_ids == NR_CPUS||nr_duplicate_ids == NR_CPUS) - return; - - /* -* Firstly, compare the proc_id with duplicate IDs, if the proc_id is -* already in the IDs, do nothing. -*/ - for (i = 0; i < nr_duplicate_ids; i++) { - if (duplicate_processor_ids[i] == proc_id) - return; - } - - /* -* Secondly, compare the proc_id with unique IDs, if the proc_id is in -* the IDs, put it in the duplicate IDs. -*/ - for (i = 0; i < nr_unique_ids; i++) { - if (unique_processor_ids[i] == proc_id) { - duplicate_processor_ids[nr_duplicate_ids] = proc_id; - nr_duplicate_ids++; - return; - } - } - - /* -* Lastly, the proc_id is a unique ID, put it in the unique IDs. -*/ - unique_processor_ids[nr_unique_ids] = proc_id; - nr_unique_ids++; -} - -static acpi_status __init acpi_processor_ids_walk(acpi_handle handle, - u32 lvl, - void *context, - void **rv) -{ - acpi_status status; - acpi_object_type acpi_type; - unsigned long long uid; - union acpi_object object = { 0 }; - struct acpi_buffer buffer = { sizeof(union acpi_object), &object }; - - status = acpi_get_type(handle, &acpi_type); - if (ACPI_FAILURE(status)) - return false; - - switch (acpi_type) { - case ACPI_TYPE_PROCESSOR: - status = acpi_evaluate_object(handle, NULL, NULL, &buffer); - if (ACPI_FAILURE(status)) - goto err; - uid = object.processor.proc_id; - break; - - case ACPI_TYPE_DEVICE: - status = acpi_evaluate_integer(handle, "_UID", NULL, &uid); - if (ACPI_FAILURE(status)) - goto err; - break; - default: - goto err; - } - - processor_validated_ids_update(uid); - return true; - -err: - acpi_handle_info(handle, "Invalid processor object\n"); - return false; - -} - -static void __init acpi_processor_check_duplicates(void) -{ - /* check the correctness for all processors in ACPI namespace */ - acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT, - ACPI_UINT32_MAX, - acpi_processor_ids_walk, - NULL, NULL, NULL); - acpi_get_devices(ACPI_PROCESSOR_DEVICE_HID, acpi_processor_ids_walk, - NULL, NULL); -} - -bool acpi_duplicate_processor_id(int proc_id) -{ - int i; - - /* -
[PATCH 0/1] Remove the check of duplicate processors
There is a bug in CPU hotplug code, I have two simple fix method, but I can't ensure which one is better. So sent it out, seek help. We found some processors which have the same processor ID but in different PXM in ACPI namespace. such as this: proc_id |pxm 0 <-> 0 1 <-> 0 2 <-> 1 3 <-> 1 .. 89 <-> 0 89 <-> 1 89 <-> 2 89 <-> 3 .. So we create a mechanism to validate them. make the processor(ID=89) as invalid. And once a processor be hotplugged physically, we check its processor id. Commit 8e089eaa1999 ("acpi: Provide mechanism to validate processors in the ACPI tables") Commit a77d6cd96849 ("acpi/processor: Check for duplicate processor ids at hotplug time") Recently, I found the check mechanism has a bug, it didn't use the acpi_processor_ids_walk() right and always gave us a wrong result. First, I fixed it by modifying the value with AE_OK which is the standard acpi_status value. https://lkml.org/lkml/2018/3/20/273 But, now, I even think this check is useless. my reasons are following: 1). Based on the practical effect, It works well, and no bug be reported 2). Based on the code, the duplicate cases can be dealed with by if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) That seems more reasonable, let's see the following case: Before the patch, After the patch the first processor(ID=89) hot-addfailedsuccess the others processor(ID=89) hot-add failedfailed So, What's your idea about it. Dou Liyang (1): acpi/processor: Remove the check of duplicates processor ids drivers/acpi/acpi_processor.c | 126 -- include/linux/acpi.h | 3 - 2 files changed, 129 deletions(-) -- 2.14.3
[PATCH] x86/vector: Merge the allocate_vector() into assign_vector_locked()
The assign_vector_locked() calls allocate_vector to get a real vector for an IRQ. If the current target CPU is online and in the new requested affinity mask, allocate_vector() will return 0 and nothing should be done. But, the assign_vector_locked() always call the apic_update_irq_cfg(). it's pointless. Merge the two function to avoid this situation and make the code simplify. Signed-off-by: Dou Liyang --- arch/x86/kernel/apic/vector.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index bb6f7a2148d7..a75de0792942 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -218,7 +218,8 @@ static int reserve_irq_vector(struct irq_data *irqd) return 0; } -static int allocate_vector(struct irq_data *irqd, const struct cpumask *dest) +static int +assign_vector_locked(struct irq_data *irqd, const struct cpumask *dest) { struct apic_chip_data *apicd = apic_chip_data(irqd); bool resvd = apicd->has_reserved; @@ -236,22 +237,12 @@ static int allocate_vector(struct irq_data *irqd, const struct cpumask *dest) return 0; vector = irq_matrix_alloc(vector_matrix, dest, resvd, &cpu); - if (vector > 0) - apic_update_vector(irqd, vector, cpu); trace_vector_alloc(irqd->irq, vector, resvd, vector); - return vector; -} - -static int assign_vector_locked(struct irq_data *irqd, - const struct cpumask *dest) -{ - struct apic_chip_data *apicd = apic_chip_data(irqd); - int vector = allocate_vector(irqd, dest); - if (vector < 0) return vector; + apic_update_vector(irqd, vector, cpu); + apic_update_irq_cfg(irqd, vector, cpu); - apic_update_irq_cfg(irqd, apicd->vector, apicd->cpu); return 0; } -- 2.14.3
Re: recent patch "x86/acpi: Prevent X2APIC id 0xffffffff from being accounted"
Hi Jan, At 05/02/2018 02:39 PM, Jan Beulich wrote: On 02.05.18 at 03:56, wrote: At 04/27/2018 08:09 PM, Jan Beulich wrote: I'm afraid I don't understand: Limiting the number of disabled CPUs is certainly desirable when those can never be used, but why would you want to do this when they might later get hotplugged? I'm not aware Let's see the workflow of CPU hotplug: 1) get the CPU device info from ACPI namespace - it contains logical processor id 2) through the logical processor id, get the LACPI entry in MADT. 3) generate the CPU for kernel(will create a CPU id, can see by lscpu) Normally, there are no valid CPU devices in 1) which are mapped to the LACPI entries(0xff or 0x). The actually number of hotplugged CPUs depends on CPU devices/processors in ACPI namespace. The number calculated from MADT is the maximum situation which can be cut and doesn't affect CPU hotplug. Don't worry about it. Now, in ACPI-based system, Linux gets the number of possible CPUs by MADT, We are going to use the ACPI namespace to make the number accurate. But it is so hard, because it's so late to initialize the ACPI namespace. So are you envisioning a model when the number of disabled CPUs can be increased once the ACPI interpreter has been enabled? Otherwise the Yes, good idea, but, As Thomas said: It needs to run _before_ setup_percpu() is invoked to scale everything correctly. maximum recorded during early boot may be too low with the changes in question. (And FTR, I agree this number may also be way too large without them, but it being too large is a resource efficiency problem, while it being too low is a functionality one.) Too large number will waste vectors, and even cause bugs. IMO, the number will just be more accurate by excluding 0x, it only equal to, even be larger than the real number of possible CPUs. Also, for background, besides wanting to clarify the correctness of these two changes, I'm also trying to understand whether we want to mirror them into the Xen hypervisor, which relies on the Dom0 kernel's ACPI interpreter (i.e. it can and does parse MADT, but can't and doesn't parse the ACPI name space). Hence late adjustment of the number of hotpluggable CPUs would be even more problematic in that environment. I am not familiar with Xen, I will consider that. Thanks, dou Jan
Re: recent patch "x86/acpi: Prevent X2APIC id 0xffffffff from being accounted"
At 04/27/2018 08:09 PM, Jan Beulich wrote: On 27.04.18 at 10:32, wrote: At 04/27/2018 03:21 PM, Jan Beulich wrote: I've just stumbled across this commit, and I'm wondering if that's actually correct (I too have at least one system where such IDs are reported in MADT): For offline/absent CPUs, the firmware may not know the APIC IDs The MADT table is not reasonable, the Local APIC entries(xAPIC/x2APIC) in MADT is not always correct as OS want. So, we should do this sanity check. Of course, sanity checking is necessary. at the point MADT is built, so I think it is quite reasonable to put ~0 in there. The ACPID spec specifically calls out that the IDs must not change across sleep states, which implies to me that they may change across an offline period of a CPU. IOW I think such entries still need to contribute to the count of disabled CPUs. Aha, there are no CPU devices which will map it's physical ID to the illegal number 0x, So it will never be used. The ID will never be used, yes, but the CPU may be (after firmware has assigned it a valid ID). BTW, Limiting the number of the disabled CPUs is also a goal. I'm afraid I don't understand: Limiting the number of disabled CPUs is certainly desirable when those can never be used, but why would you want to do this when they might later get hotplugged? I'm not aware Let's see the workflow of CPU hotplug: 1) get the CPU device info from ACPI namespace - it contains logical processor id 2) through the logical processor id, get the LACPI entry in MADT. 3) generate the CPU for kernel(will create a CPU id, can see by lscpu) Normally, there are no valid CPU devices in 1) which are mapped to the LACPI entries(0xff or 0x). The actually number of hotplugged CPUs depends on CPU devices/processors in ACPI namespace. The number calculated from MADT is the maximum situation which can be cut and doesn't affect CPU hotplug. Don't worry about it. Now, in ACPI-based system, Linux gets the number of possible CPUs by MADT, We are going to use the ACPI namespace to make the number accurate. But it is so hard, because it's so late to initialize the ACPI namespace. Thanks, dou of a way to tell, by just looking at the MADT, which of the two cases apply. Jan
Re: recent patch "x86/acpi: Prevent X2APIC id 0xffffffff from being accounted"
Hi Jan, At 04/27/2018 03:21 PM, Jan Beulich wrote: Hello, I've just stumbled across this commit, and I'm wondering if that's actually correct (I too have at least one system where such IDs are reported in MADT): For offline/absent CPUs, the firmware may not know the APIC IDs The MADT table is not reasonable, the Local APIC entries(xAPIC/x2APIC) in MADT is not always correct as OS want. So, we should do this sanity check. at the point MADT is built, so I think it is quite reasonable to put ~0 in there. The ACPID spec specifically calls out that the IDs must not change across sleep states, which implies to me that they may change across an offline period of a CPU. IOW I think such entries still need to contribute to the count of disabled CPUs. Aha, there are no CPU devices which will map it's physical ID to the illegal number 0x, So it will never be used. BTW, Limiting the number of the disabled CPUs is also a goal. Thanks, dou I notice a similar change has been done for the xAPIC case a while ago by you, Thomas. Jan
[tip:x86/urgent] x86/vector: Remove the unused macro FPU_IRQ
Commit-ID: 7d878817db22f64c2b2c241335ec03e4c3fd5476 Gitweb: https://git.kernel.org/tip/7d878817db22f64c2b2c241335ec03e4c3fd5476 Author: Dou Liyang AuthorDate: Thu, 26 Apr 2018 14:08:32 +0800 Committer: Thomas Gleixner CommitDate: Thu, 26 Apr 2018 11:57:57 +0200 x86/vector: Remove the unused macro FPU_IRQ The macro FPU_IRQ has never been used since v3.10, So remove it. Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: h...@zytor.com Link: https://lkml.kernel.org/r/20180426060832.27312-1-douly.f...@cn.fujitsu.com --- arch/x86/include/asm/irq_vectors.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 57003074bd7a..548d90bbf919 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -114,8 +114,6 @@ #define FIRST_SYSTEM_VECTORNR_VECTORS #endif -#define FPU_IRQ 13 - /* * Size the maximum number of interrupts. *
[PATCH] x86/vector: Remove the unused macro FPU_IRQ
The macro FPU_IRQ has never been used since v3.10, So remove it. Signed-off-by: Dou Liyang --- arch/x86/include/asm/irq_vectors.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 57003074bd7a..548d90bbf919 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -114,8 +114,6 @@ #define FIRST_SYSTEM_VECTORNR_VECTORS #endif -#define FPU_IRQ 13 - /* * Size the maximum number of interrupts. * -- 2.14.3
[PATCH] x86/idt: Simplify the idt_setup_apic_and_irq_gates()
The vectors between FIRST_SYSTEM_VECTOR and NR_VECTORS are special IRQ vectors used by the SMP architecture. But, if X86_LOCAL_APIC=n, it will not be used, and the FIRST_SYSTEM_VECTOR is equal to NR_VECTORS. idt_setup_apic_and_irq_gates() didn't notice that, which make the code a little complex. Remove the code of the X86_LOCAL_APIC=n case to simplify it. Signed-off-by: Dou Liyang --- arch/x86/kernel/idt.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index 2c3a1b4294eb..8b4174890706 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -317,15 +317,16 @@ void __init idt_setup_apic_and_irq_gates(void) set_intr_gate(i, entry); } - for_each_clear_bit_from(i, system_vectors, NR_VECTORS) { + /* +* If X86_LOCAL_APIC=n, the FIRST_SYSTEM_VECTOR is equal to NR_VECTORS +* Just consider the X86_LOCAL_APIC=y case +*/ #ifdef CONFIG_X86_LOCAL_APIC + for_each_clear_bit_from(i, system_vectors, NR_VECTORS) { set_bit(i, system_vectors); set_intr_gate(i, spurious_interrupt); -#else - entry = irq_entries_start + 8 * (i - FIRST_EXTERNAL_VECTOR); - set_intr_gate(i, entry); -#endif } +#endif } /** -- 2.14.3
[tip:x86/urgent] x86/vector: Remove the macro VECTOR_OFFSET_START
Commit-ID: e3072805c61167b85a30ceeef606620704db31f7 Gitweb: https://git.kernel.org/tip/e3072805c61167b85a30ceeef606620704db31f7 Author: Dou Liyang AuthorDate: Wed, 25 Apr 2018 10:05:53 +0800 Committer: Ingo Molnar CommitDate: Thu, 26 Apr 2018 07:31:17 +0200 x86/vector: Remove the macro VECTOR_OFFSET_START Now, Linux uses matrix allocator for vector assignment, the original assignment code which used VECTOR_OFFSET_START has been removed. So remove the stale macro as well. Fixes: commit 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Acked-by: David Rientjes Cc: h...@zytor.com Link: https://lkml.kernel.org/r/20180425020553.17210-1-douly.f...@cn.fujitsu.com Signed-off-by: Ingo Molnar --- arch/x86/include/asm/irq_vectors.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 404c5fdff859..57003074bd7a 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -34,11 +34,6 @@ * (0x80 is the syscall vector, 0x30-0x3f are for ISA) */ #define FIRST_EXTERNAL_VECTOR 0x20 -/* - * We start allocating at 0x21 to spread out vectors evenly between - * priority levels. (0x80 is the syscall vector) - */ -#define VECTOR_OFFSET_START1 /* * Reserve the lowest usable vector (and hence lowest priority) 0x20 for
[tip:x86/urgent] x86/vector: Remove the macro VECTOR_OFFSET_START
Commit-ID: 5a626a8dfb58a64a39f4351e3962e7320191f189 Gitweb: https://git.kernel.org/tip/5a626a8dfb58a64a39f4351e3962e7320191f189 Author: Dou Liyang AuthorDate: Wed, 25 Apr 2018 10:05:53 +0800 Committer: Thomas Gleixner CommitDate: Wed, 25 Apr 2018 10:56:24 +0200 x86/vector: Remove the macro VECTOR_OFFSET_START Now, Linux uses matrix allocator for vector assignment, the original assignment code which used VECTOR_OFFSET_START has been removed. So remove the stale macro as well. Fixes: commit 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Acked-by: David Rientjes Cc: h...@zytor.com Link: https://lkml.kernel.org/r/20180425020553.17210-1-douly.f...@cn.fujitsu.com --- arch/x86/include/asm/irq_vectors.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 404c5fdff859..57003074bd7a 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -34,11 +34,6 @@ * (0x80 is the syscall vector, 0x30-0x3f are for ISA) */ #define FIRST_EXTERNAL_VECTOR 0x20 -/* - * We start allocating at 0x21 to spread out vectors evenly between - * priority levels. (0x80 is the syscall vector) - */ -#define VECTOR_OFFSET_START1 /* * Reserve the lowest usable vector (and hence lowest priority) 0x20 for
[PATCH] x86/vector: Remove the macro VECTOR_OFFSET_START
Now, Linux uses matrix allocator for vector assignment, the original assignment code which used VECTOR_OFFSET_START has been removed. So remove the stale macro as well Signed-off-by: Dou Liyang --- arch/x86/include/asm/irq_vectors.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 404c5fdff859..57003074bd7a 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -34,11 +34,6 @@ * (0x80 is the syscall vector, 0x30-0x3f are for ISA) */ #define FIRST_EXTERNAL_VECTOR 0x20 -/* - * We start allocating at 0x21 to spread out vectors evenly between - * priority levels. (0x80 is the syscall vector) - */ -#define VECTOR_OFFSET_START1 /* * Reserve the lowest usable vector (and hence lowest priority) 0x20 for -- 2.14.3
Re: [RESEND PATCH] x86/boot/KASLR: Extend movable_node option for KASLR
Hi Ingo, Any comments about that? Now, When users want to support node hotplug with KASLR, they use 'mem=' to restrict the boot-up memory to the first node memory size. If we want to boot up some hotpluggable node, their memory can't be shown. IMO, only few machines can support physical NUMA Node hotplug, and we can't get memory hotplug info from ACPI SRAT earlier now(If we can do that, we even can remove the 'movable_node' option). So, IMO, extend movable_node to replace the misuse of 'mem' option. Thought? Thanks, dou At 04/03/2018 11:36 AM, Dou Liyang wrote: The movable_node option is a boot-time switch to make sure the physical NUMA nodes can be hot-added/removed when ACPI table can't be parsed to provide the memory hotplug information. As we all know, there is always one node, called "home node", which can't be movabled and the kernel image resides in it. With movable_node option, Linux allocates new early memorys near the kernel image to avoid using the other movable node. But, due to KASLR also can't get the the memory hotplug information, it may randomize the kernel image into a movable node which breaks the rule of movable_node option and makes the physical hot-add/remove operation failed. The perfect solution is providing the memory hotplug information to KASLR. But, it needs the efforts from hardware engineers and software engineers. Here is an alternative method. Extend movable_node option to restrict kernel to be randomized in the home node by adding a parameter. this parameter sets up the boundaries between the home nodes and other nodes. Reported-by: Chao Fan Signed-off-by: Dou Liyang Reviewed-by: Kees Cook --- Changelog: -Rewrite the commit log and document. Documentation/admin-guide/kernel-parameters.txt | 12 ++-- arch/x86/boot/compressed/kaslr.c| 19 --- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1d1d53f85ddd..0cfc0b10a117 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2353,7 +2353,8 @@ mousedev.yres= [MOUSE] Vertical screen resolution, used for devices reporting absolute coordinates, such as tablets - movablecore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter + movablecore=nn[KMG] + [KNL,X86,IA-64,PPC] This parameter is similar to kernelcore except it specifies the amount of memory used for migratable allocations. If both kernelcore and movablecore is specified, @@ -2363,12 +2364,19 @@ that the amount of memory usable for all allocations is not too small. - movable_node [KNL] Boot-time switch to make hotplugable memory + movable_node[KNL] Boot-time switch to make hot-pluggable memory NUMA nodes to be movable. This means that the memory of such nodes will be usable only for movable allocations which rules out almost all kernel allocations. Use with caution! + movable_node=nn[KMG] + [KNL] Extend movable_node to make it work well with KASLR. + This parameter is the boundaries between the "home node" and + the other nodes. The "home node" is an immovable node and is + defined by BIOS. Set the 'nn' to the memory size of "home + node", the kernel image will be extracted in immovable nodes. + MTD_Partition= [MTD] Format: ,,, diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c index 8199a6187251..f906d7890e69 100644 --- a/arch/x86/boot/compressed/kaslr.c +++ b/arch/x86/boot/compressed/kaslr.c @@ -92,7 +92,10 @@ struct mem_vector { static bool memmap_too_large; -/* Store memory limit specified by "mem=nn[KMG]" or "memmap=nn[KMG]" */ +/* + * Store memory limit specified by the following situations: + * "mem=nn[KMG]" or "memmap=nn[KMG]" or "movable_node=nn[KMG]" + */ unsigned long long mem_limit = ULLONG_MAX; @@ -214,7 +217,8 @@ static int handle_mem_memmap(void) char *param, *val; u64 mem_size; - if (!strstr(args, "memmap=") && !strstr(args, "mem=")) + if (!strstr(args, "memmap=") && !strstr(args, "mem=") && + !strstr(args, "movable_node=")) return 0; tmp_cmdline = malloc(len + 1); @@ -249,7 +253,16 @@ static int handle_mem_memmap(void)
[tip:x86/urgent] x86/processor: Remove two unused function declarations
Commit-ID: 451cf3ca7d4615631443014ee769c25e267c25ff Gitweb: https://git.kernel.org/tip/451cf3ca7d4615631443014ee769c25e267c25ff Author: Dou Liyang AuthorDate: Wed, 4 Apr 2018 14:45:27 +0800 Committer: Thomas Gleixner CommitDate: Tue, 17 Apr 2018 11:56:32 +0200 x86/processor: Remove two unused function declarations early_trap_init() and cpu_set_gdt() have been removed, so remove the stale declarations as well. Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: keesc...@chromium.org Cc: l...@kernel.org Cc: h...@zytor.com Cc: b...@suse.de Cc: kirill.shute...@linux.intel.com Link: https://lkml.kernel.org/r/20180404064527.10562-1-douly.f...@cn.fujitsu.com --- arch/x86/include/asm/processor.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 4fa4206029e3..21a114914ba4 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -749,13 +749,11 @@ enum idle_boot_override {IDLE_NO_OVERRIDE=0, IDLE_HALT, IDLE_NOMWAIT, extern void enable_sep_cpu(void); extern int sysenter_setup(void); -extern void early_trap_init(void); void early_trap_pf_init(void); /* Defined in head.S */ extern struct desc_ptr early_gdt_descr; -extern void cpu_set_gdt(int); extern void switch_to_new_gdt(int); extern void load_direct_gdt(int); extern void load_fixmap_gdt(int);
[tip:x86/urgent] x86/acpi: Prevent X2APIC id 0xffffffff from being accounted
Commit-ID: 10daf10ab154e31237a8c07242be3063fb6a9bf4 Gitweb: https://git.kernel.org/tip/10daf10ab154e31237a8c07242be3063fb6a9bf4 Author: Dou Liyang AuthorDate: Thu, 12 Apr 2018 09:40:52 +0800 Committer: Thomas Gleixner CommitDate: Tue, 17 Apr 2018 11:56:31 +0200 x86/acpi: Prevent X2APIC id 0x from being accounted RongQing reported that there are some X2APIC id 0x in his machine's ACPI MADT table, which makes the number of possible CPU inaccurate. The reason is that the ACPI X2APIC parser has no sanity check for APIC ID 0x, which is an invalid id in all APIC types. See "Intel® 64 Architecture x2APIC Specification", Chapter 2.4.1. Add a sanity check to acpi_parse_x2apic() which ignores the invalid id. Reported-by: Li RongQing Signed-off-by: Dou Liyang Signed-off-by: Thomas Gleixner Cc: sta...@vger.kernel.org Cc: len.br...@intel.com Cc: r...@rjwysocki.net Cc: h...@zytor.com Link: https://lkml.kernel.org/r/20180412014052.25186-1-douly.f...@cn.fujitsu.com --- arch/x86/kernel/acpi/boot.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index dde444f932c1..3b20607d581b 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -215,6 +215,10 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) apic_id = processor->local_apic_id; enabled = processor->lapic_flags & ACPI_MADT_ENABLED; + /* Ignore invalid ID */ + if (apic_id == 0x) + return 0; + /* * We need to register disabled CPU as well to permit * counting disabled CPUs. This allows us to size
[PATCH] x86/acpi: Prevent X2APIC id 0xffffffff from being accounted
RongQing reported that there are some X2APIC id 0x in his machine's ACPI MADT table, which made the number of possible CPU inaccuracy. The reason is that the ACPI X2APIC parser has no sanity check for apicid 0x, which is an invalid id in all APIC types. See "Intel® 64 Architecture x2APIC Specification", Chapter 2.4.1. Add a sanity check to acpi_parse_x2apic() which ignores the invalid id. Reported-by: Li RongQing Signed-off-by: Dou Liyang --- arch/x86/kernel/acpi/boot.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index dde444f932c1..3b20607d581b 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -215,6 +215,10 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) apic_id = processor->local_apic_id; enabled = processor->lapic_flags & ACPI_MADT_ENABLED; + /* Ignore invalid ID */ + if (apic_id == 0x) + return 0; + /* * We need to register disabled CPU as well to permit * counting disabled CPUs. This allows us to size -- 2.14.3
Re: [tip:x86/urgent] x86/apic: Fix signedness bug in APIC ID validity checks
Hi Thomas, At 04/10/2018 10:51 PM, tip-bot for Li RongQing wrote: [...] x86/apic: Fix signedness bug in APIC ID validity checks The APIC ID as parsed from ACPI MADT is validity checked with the apic->apic_id_valid() callback, which depends on the selected APIC type. For non X2APIC types APIC IDs >= 0xFF are invalid, but values > 0x7FFF Today when I am reading "Intel® 64 Architecture x2APIC Specification", I find that below in chapter 2.4.1: The APIC ID value of _H and the highest value corresponding to the imple-mented bit-width of the local APIC ID register in the system are reserved and cannot be assigned to any logical processor. Seems, _H is also invalid for X2APIC types, Shall we also do the validity check for X2APIC ID? acpi_parse_x2apic() ... /* Ignore invalid ID */ if (apic_id == 0x) return 0; ... Thanks, dou
Re: 答复: 答复: [RFC PATCH] x86/acpi: Prevent x2apic id -1 from being accounted
Hi Rongqing, At 04/09/2018 05:33 PM, Li,Rongqing wrote: [...] Thanks. How about the below modification: It is good to me. you need start a new v2 thread for this. so that maintainers can easy to get it. commit 96ba42cf87ce0e62d54c01bfa9a9479b2e87 Author: Li RongQing Date: Sun Apr 8 18:54:10 2018 +0800 x86/acpi: Prevent all the X2APIC Id from being parsed in non-x2apic mode the values of x2APIC ID is greater than 0xff in ACPI MADT, if acpi is apic_flat, default_apic_id_valid() is called to check id which is converted from u32 to int, will return true if id is larger than 0x7fff, this is wrong and if local_apic_id is invalid, we should prevent it from being accounted This fixes a bug that Purley platform displays too many possible cpu Signed-off-by: Li RongQing Suggested-by:: Dou Liyang ^---^ no need to add this, it is my pleasure Thanks, dou Cc: Peter Zijlstra Cc: Thomas Gleixner diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 40a3d3642f3a..08acd954f00e 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -313,7 +313,7 @@ struct apic { /* Probe, setup and smpboot functions */ int (*probe)(void); int (*acpi_madt_oem_check)(char *oem_id, char *oem_table_id); - int (*apic_id_valid)(int apicid); + int (*apic_id_valid)(u32 apicid); int (*apic_id_registered)(void); bool(*check_apicid_used)(physid_mask_t *map, int apicid); @@ -486,7 +486,7 @@ static inline unsigned int read_apic_id(void) return apic->get_apic_id(reg); } -extern int default_apic_id_valid(int apicid); +extern int default_apic_id_valid(u32 apicid); extern int default_acpi_madt_oem_check(char *, char *); extern void default_setup_apic_routing(void); diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 7a37d9357bc4..4ba949de1ca9 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -200,7 +200,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) { struct acpi_madt_local_x2apic *processor = NULL; #ifdef CONFIG_X86_X2APIC - int apic_id; + u32 apic_id; u8 enabled; #endif @@ -222,10 +222,13 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) * to not preallocating memory for all NR_CPUS * when we use CPU hotplug. */ - if (!apic->apic_id_valid(apic_id) && enabled) - printk(KERN_WARNING PREFIX "x2apic entry ignored\n"); - else - acpi_register_lapic(apic_id, processor->uid, enabled); + if (!apic->apic_id_valid(apic_id)) { + if (enabled) + printk(KERN_WARNING PREFIX "x2apic entry ignored\n"); + return 0; + } + + acpi_register_lapic(apic_id, processor->uid, enabled); #else printk(KERN_WARNING PREFIX "x2apic entry ignored\n"); #endif diff --git a/arch/x86/kernel/apic/apic_common.c b/arch/x86/kernel/apic/apic_common.c index a360801779ae..02b4839478b1 100644 --- a/arch/x86/kernel/apic/apic_common.c +++ b/arch/x86/kernel/apic/apic_common.c @@ -40,7 +40,7 @@ int default_check_phys_apicid_present(int phys_apicid) return physid_isset(phys_apicid, phys_cpu_present_map); } -int default_apic_id_valid(int apicid) +int default_apic_id_valid(u32 apicid) { return (apicid < 255); } diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 134e04506ab4..78778b54f904 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -56,7 +56,7 @@ static u32 numachip2_set_apic_id(unsigned int id) return id << 24; } -static int numachip_apic_id_valid(int apicid) +static int numachip_apic_id_valid(u32 apicid) { /* Trust what bootloader passes in MADT */ return 1; diff --git a/arch/x86/kernel/apic/x2apic.h b/arch/x86/kernel/apic/x2apic.h index b107de381cb5..a49b3604027f 100644 --- a/arch/x86/kernel/apic/x2apic.h +++ b/arch/x86/kernel/apic/x2apic.h @@ -1,6 +1,6 @@ /* Common bits for X2APIC cluster/physical modes. */ -int x2apic_apic_id_valid(int apicid); +int x2apic_apic_id_valid(u32 apicid); int x2apic_apic_id_registered(void); void __x2apic_send_IPI_dest(unsigned int apicid, int vector, unsigned int dest); unsigned int x2apic_get_apic_id(unsigned long id); diff --git a/arch/x86/kernel/apic/x2apic_phys.c b/arch/x86/kernel/apic/x2apic_phys.c index e2829bf40e4a..b5cf9e7b3830 100644 --- a/arch/x86/kernel/apic/x2apic_phys.c +++ b/arch/x86/kernel/apic/x2apic_phys.c @@ -101,7 +101,7 @@ static int x2apic_phys_probe(void) } /* Common x2apic functions, also used by
Re: 答复: [RFC PATCH] x86/acpi: Prevent x2apic id -1 from being accounted
RongQing, At 04/09/2018 02:38 PM, Li,Rongqing wrote: -邮件原件- 发件人: Dou Liyang [mailto:douly.f...@cn.fujitsu.com] 发送时间: 2018年4月9日 13:38 收件人: Li,Rongqing ; linux-kernel@vger.kernel.org; t...@linutronix.de; mi...@redhat.com; h...@zytor.com; jgr...@suse.com; x...@kernel.org; pet...@infradead.org 主题: Re: [RFC PATCH] x86/acpi: Prevent x2apic id -1 from being accounted Hi RongQing, Is there an local x2apic whose ID is in your machine? I think no [...] [0.00] ACPI: X2APIC (apic_id[0x] uid[0x00] disabled) [0.00] ACPI: X2APIC (apic_id[0x] uid[0x01] disabled) [0.00] ACPI: X2APIC (apic_id[0x] uid[0x02] disabled) Ah, sure enough! [...] At 04/08/2018 07:38 PM, Li RongQing wrote: local_apic_id of acpi_madt_local_x2apic is u32, it is converted to int when checked by default_apic_id_valid() and return true if it is larger than 0x7fff, this is wrong For x2apic enabled systems, - the byte length of X2APIC ID is 4, and it can be larger than 0x7fff in theory Yes - the ->apic_id_valid points to x2apic_apic_id_valid(), which always return _ture_ , not default_apic_id_valid(). To this machine, the apic is apic_flat I see, I am sorry the title and changelog make me misunderstand. Here, actually, we prevent all the X2APIC Id from being parsed in non-x2apic mode, not just 0x. because the values of x2APIC ID must be 255 and greater in ACPI MADT. -RongQing Thanks, dou and if local_apic_id is invalid, we should prevent it from being accounted > This fixes a bug that Purley platform displays too many possible cpu Signed-off-by: Li RongQing Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Dou Liyang --- arch/x86/include/asm/apic.h | 4 ++-- arch/x86/kernel/acpi/boot.c | 10 ++ arch/x86/kernel/apic/apic_common.c | 2 +- arch/x86/kernel/apic/apic_numachip.c | 2 +- arch/x86/kernel/apic/x2apic.h| 2 +- arch/x86/kernel/apic/x2apic_phys.c | 2 +- arch/x86/kernel/apic/x2apic_uv_x.c | 2 +- arch/x86/xen/apic.c | 2 +- 8 files changed, 14 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 40a3d3642f3a..08acd954f00e 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -313,7 +313,7 @@ struct apic { /* Probe, setup and smpboot functions */ int (*probe)(void); int (*acpi_madt_oem_check)(char *oem_id, char *oem_table_id); - int (*apic_id_valid)(int apicid); + int (*apic_id_valid)(u32 apicid); int (*apic_id_registered)(void); bool(*check_apicid_used)(physid_mask_t *map, int apicid); @@ -486,7 +486,7 @@ static inline unsigned int read_apic_id(void) return apic->get_apic_id(reg); } -extern int default_apic_id_valid(int apicid); +extern int default_apic_id_valid(u32 apicid); extern int default_acpi_madt_oem_check(char *, char *); extern void default_setup_apic_routing(void); diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 7a37d9357bc4..7412564dc2a7 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -200,7 +200,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) { struct acpi_madt_local_x2apic *processor = NULL; #ifdef CONFIG_X86_X2APIC - int apic_id; + u32 apic_id; u8 enabled; #endif @@ -222,10 +222,12 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) * to not preallocating memory for all NR_CPUS * when we use CPU hotplug. */ - if (!apic->apic_id_valid(apic_id) && enabled) + if (!apic->apic_id_valid(apic_id)) { printk(KERN_WARNING PREFIX "x2apic entry ignored\n"); Due to the disable APIC entries may not exist physically, please just printk the warning if the APIC ID is enabled. if (!apic->apic_id_valid(apic_id)) { if(enabled) printk... return 0; } Thanks, dou
Re: [PATCH] genirq: only scan the present CPUs
Hi Peter, At 04/06/2018 05:05 PM, Peter Zijlstra wrote: On Fri, Apr 06, 2018 at 11:02:28AM +0200, Peter Zijlstra wrote: On Fri, Apr 06, 2018 at 04:42:14PM +0800, Dou Liyang wrote: Hi Thomas, Peter, At 04/03/2018 07:23 PM, Peter Zijlstra wrote: On Tue, Apr 03, 2018 at 12:25:56PM +0200, Thomas Gleixner wrote: On Mon, 2 Apr 2018, Li RongQing wrote: lots of application will read /proc/stat, like ps and vmstat, but we find the reading time are spreading on Purley platform which has lots of possible CPUs and interrupt. To reduce the reading time, only scan the present CPUs, not all possible CPUs, which speeds the reading of /proc/stat 20 times on Purley platform which has 56 present CPUs, and 224 possible CPUs Why is BIOS/ACPI telling the kernel that there are 224 possible CPUs unless it supports physical CPU hotplug. BIOS is crap, news at 11. I've got boxes like that too. Use possible_cpu=$nr if you're bothered by it -- it's what I do. Yes, I think so. it is a manual way to reset the number. For this situation, I am investigating to restrict the number of possible CPUs automatically, But, due to the limitation of ACPI subsystem, I can do it _before_ setup_percpu_area where the number will be used. Ah, did you mean to day "I can _NOT_ do it" ? Still I don't see the ^--- Oops, yes. point of frobbing random users if the whole thing is buggered. If ACPI subsystem can be initialized earlier, we can get the accurate number of possible CPUs from the ACPI namespace. then, we can reset the _cpu_possible_mask_ as the prefill_possible_map() does. So, it can forbid random users. But, It needs the memory to be initialized first, so it can't be called earlier setup_percpu_area() which is evoked earlier than mem_init(). and you are right: "So if you see it enumerates a gazillion empty spots but the system does not in fact support physical hotplug, we should discard those." I will think it more carefully. Thanks, dou
Re: [RFC PATCH] x86/acpi: Prevent x2apic id -1 from being accounted
Hi RongQing, Is there an local x2apic whose ID is 0x in your machine? At 04/08/2018 07:38 PM, Li RongQing wrote: local_apic_id of acpi_madt_local_x2apic is u32, it is converted to int when checked by default_apic_id_valid() and return true if it is larger than 0x7fff, this is wrong For x2apic enabled systems, - the byte length of X2APIC ID is 4, and it can be larger than 0x7fff in theory - the ->apic_id_valid points to x2apic_apic_id_valid(), which always return _ture_ , not default_apic_id_valid(). Thanks, dou and if local_apic_id is invalid, we should prevent it from being accounted > This fixes a bug that Purley platform displays too many possible cpu Signed-off-by: Li RongQing Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Dou Liyang --- arch/x86/include/asm/apic.h | 4 ++-- arch/x86/kernel/acpi/boot.c | 10 ++ arch/x86/kernel/apic/apic_common.c | 2 +- arch/x86/kernel/apic/apic_numachip.c | 2 +- arch/x86/kernel/apic/x2apic.h| 2 +- arch/x86/kernel/apic/x2apic_phys.c | 2 +- arch/x86/kernel/apic/x2apic_uv_x.c | 2 +- arch/x86/xen/apic.c | 2 +- 8 files changed, 14 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 40a3d3642f3a..08acd954f00e 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -313,7 +313,7 @@ struct apic { /* Probe, setup and smpboot functions */ int (*probe)(void); int (*acpi_madt_oem_check)(char *oem_id, char *oem_table_id); - int (*apic_id_valid)(int apicid); + int (*apic_id_valid)(u32 apicid); int (*apic_id_registered)(void); bool (*check_apicid_used)(physid_mask_t *map, int apicid); @@ -486,7 +486,7 @@ static inline unsigned int read_apic_id(void) return apic->get_apic_id(reg); } -extern int default_apic_id_valid(int apicid); +extern int default_apic_id_valid(u32 apicid); extern int default_acpi_madt_oem_check(char *, char *); extern void default_setup_apic_routing(void); diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 7a37d9357bc4..7412564dc2a7 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -200,7 +200,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) { struct acpi_madt_local_x2apic *processor = NULL; #ifdef CONFIG_X86_X2APIC - int apic_id; + u32 apic_id; u8 enabled; #endif @@ -222,10 +222,12 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end) * to not preallocating memory for all NR_CPUS * when we use CPU hotplug. */ - if (!apic->apic_id_valid(apic_id) && enabled) + if (!apic->apic_id_valid(apic_id)) { printk(KERN_WARNING PREFIX "x2apic entry ignored\n"); - else - acpi_register_lapic(apic_id, processor->uid, enabled); + return 0; + } + + acpi_register_lapic(apic_id, processor->uid, enabled); #else printk(KERN_WARNING PREFIX "x2apic entry ignored\n"); #endif diff --git a/arch/x86/kernel/apic/apic_common.c b/arch/x86/kernel/apic/apic_common.c index a360801779ae..02b4839478b1 100644 --- a/arch/x86/kernel/apic/apic_common.c +++ b/arch/x86/kernel/apic/apic_common.c @@ -40,7 +40,7 @@ int default_check_phys_apicid_present(int phys_apicid) return physid_isset(phys_apicid, phys_cpu_present_map); } -int default_apic_id_valid(int apicid) +int default_apic_id_valid(u32 apicid) { return (apicid < 255); } diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 134e04506ab4..78778b54f904 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -56,7 +56,7 @@ static u32 numachip2_set_apic_id(unsigned int id) return id << 24; } -static int numachip_apic_id_valid(int apicid) +static int numachip_apic_id_valid(u32 apicid) { /* Trust what bootloader passes in MADT */ return 1; diff --git a/arch/x86/kernel/apic/x2apic.h b/arch/x86/kernel/apic/x2apic.h index b107de381cb5..a49b3604027f 100644 --- a/arch/x86/kernel/apic/x2apic.h +++ b/arch/x86/kernel/apic/x2apic.h @@ -1,6 +1,6 @@ /* Common bits for X2APIC cluster/physical modes. */ -int x2apic_apic_id_valid(int apicid); +int x2apic_apic_id_valid(u32 apicid); int x2apic_apic_id_registered(void); void __x2apic_send_IPI_dest(unsigned int apicid, int vector, unsigned int dest); unsigned int x2apic_get_apic_id(unsigned long id); diff --git a/arch/x86/kernel/apic/x2apic_phys.c b/arch/x86/kernel/apic/x2apic_phys.c index e2829bf40e4a..b5cf9e7b3830 100644 --- a/arch/x86/kernel/apic/x2apic_phys.c +++ b/arch/x86/kernel/apic/x2apic_phys.c @@ -101,7 +101,7 @@ static int x2apic_phys_
Re: [PATCH] genirq: only scan the present CPUs
Hi Thomas, Peter, At 04/03/2018 07:23 PM, Peter Zijlstra wrote: On Tue, Apr 03, 2018 at 12:25:56PM +0200, Thomas Gleixner wrote: On Mon, 2 Apr 2018, Li RongQing wrote: lots of application will read /proc/stat, like ps and vmstat, but we find the reading time are spreading on Purley platform which has lots of possible CPUs and interrupt. To reduce the reading time, only scan the present CPUs, not all possible CPUs, which speeds the reading of /proc/stat 20 times on Purley platform which has 56 present CPUs, and 224 possible CPUs Why is BIOS/ACPI telling the kernel that there are 224 possible CPUs unless it supports physical CPU hotplug. BIOS is crap, news at 11. I've got boxes like that too. Use possible_cpu=$nr if you're bothered by it -- it's what I do. Yes, I think so. it is a manual way to reset the number. For this situation, I am investigating to restrict the number of possible CPUs automatically, But, due to the limitation of ACPI subsystem, I can do it _before_ setup_percpu_area where the number will be used. But, I can provider an indicator to tell the system that whether the physical CPU hotplug is support or not later. Can we use this indicator like that in this situation: if ture Using for_each_possible_cpu(cpu) else Using for_each_present_cpu(cpu) Thanks, dou
[PATCH] x86/processor: Remove two legacy function declaration
the early_trap_init() and cpu_set_gdt() have been abandoned, so remove them. Signed-off-by: Dou Liyang --- arch/x86/include/asm/processor.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index b0ccd4847a58..3ef3221107c3 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -739,13 +739,11 @@ enum idle_boot_override {IDLE_NO_OVERRIDE=0, IDLE_HALT, IDLE_NOMWAIT, extern void enable_sep_cpu(void); extern int sysenter_setup(void); -extern void early_trap_init(void); void early_trap_pf_init(void); /* Defined in head.S */ extern struct desc_ptr early_gdt_descr; -extern void cpu_set_gdt(int); extern void switch_to_new_gdt(int); extern void load_direct_gdt(int); extern void load_fixmap_gdt(int); -- 2.14.3
[RESEND PATCH] x86/boot/KASLR: Extend movable_node option for KASLR
The movable_node option is a boot-time switch to make sure the physical NUMA nodes can be hot-added/removed when ACPI table can't be parsed to provide the memory hotplug information. As we all know, there is always one node, called "home node", which can't be movabled and the kernel image resides in it. With movable_node option, Linux allocates new early memorys near the kernel image to avoid using the other movable node. But, due to KASLR also can't get the the memory hotplug information, it may randomize the kernel image into a movable node which breaks the rule of movable_node option and makes the physical hot-add/remove operation failed. The perfect solution is providing the memory hotplug information to KASLR. But, it needs the efforts from hardware engineers and software engineers. Here is an alternative method. Extend movable_node option to restrict kernel to be randomized in the home node by adding a parameter. this parameter sets up the boundaries between the home nodes and other nodes. Reported-by: Chao Fan Signed-off-by: Dou Liyang Reviewed-by: Kees Cook --- Changelog: -Rewrite the commit log and document. Documentation/admin-guide/kernel-parameters.txt | 12 ++-- arch/x86/boot/compressed/kaslr.c| 19 --- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1d1d53f85ddd..0cfc0b10a117 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2353,7 +2353,8 @@ mousedev.yres= [MOUSE] Vertical screen resolution, used for devices reporting absolute coordinates, such as tablets - movablecore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter + movablecore=nn[KMG] + [KNL,X86,IA-64,PPC] This parameter is similar to kernelcore except it specifies the amount of memory used for migratable allocations. If both kernelcore and movablecore is specified, @@ -2363,12 +2364,19 @@ that the amount of memory usable for all allocations is not too small. - movable_node[KNL] Boot-time switch to make hotplugable memory + movable_node[KNL] Boot-time switch to make hot-pluggable memory NUMA nodes to be movable. This means that the memory of such nodes will be usable only for movable allocations which rules out almost all kernel allocations. Use with caution! + movable_node=nn[KMG] + [KNL] Extend movable_node to make it work well with KASLR. + This parameter is the boundaries between the "home node" and + the other nodes. The "home node" is an immovable node and is + defined by BIOS. Set the 'nn' to the memory size of "home + node", the kernel image will be extracted in immovable nodes. + MTD_Partition= [MTD] Format: ,,, diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c index 8199a6187251..f906d7890e69 100644 --- a/arch/x86/boot/compressed/kaslr.c +++ b/arch/x86/boot/compressed/kaslr.c @@ -92,7 +92,10 @@ struct mem_vector { static bool memmap_too_large; -/* Store memory limit specified by "mem=nn[KMG]" or "memmap=nn[KMG]" */ +/* + * Store memory limit specified by the following situations: + * "mem=nn[KMG]" or "memmap=nn[KMG]" or "movable_node=nn[KMG]" + */ unsigned long long mem_limit = ULLONG_MAX; @@ -214,7 +217,8 @@ static int handle_mem_memmap(void) char *param, *val; u64 mem_size; - if (!strstr(args, "memmap=") && !strstr(args, "mem=")) + if (!strstr(args, "memmap=") && !strstr(args, "mem=") && + !strstr(args, "movable_node=")) return 0; tmp_cmdline = malloc(len + 1); @@ -249,7 +253,16 @@ static int handle_mem_memmap(void) free(tmp_cmdline); return -EINVAL; } - mem_limit = mem_size; + mem_limit = mem_limit > mem_size ? mem_size : mem_limit; + } else if (!strcmp(param, "movable_node")) { + char *p = val; + + mem_size = memparse(p, &p); + if (mem_size == 0) { + free(tmp_cmdline); + return -EINVAL; + } + mem_limit = mem_limit > mem_size ? mem_size : mem_limit; } } -- 2.14.3
Re: [PATCH v9 0/5] x86/KASLR: Add parameter kaslr_boot_mem=nn[KMG]@ss[KMG]
Hi Ingo, Kees, Baoquan and Chao At 03/12/2018 06:57 PM, Ingo Molnar wrote: [...] So there's apparently a mis-design here: - KASLR needs to be done very early on during bootup: - it's not realistic to expect KASLR to be done with a booted up kernel, because pointers to various KASLR-ed objects are already widely spread out in memory. - But for some unfathomable reason the memory hotplug attribute of memory regions is not part of the regular memory map but part of late-init ACPI data structures. The right solution would be _not_ to fudge the KASLR location, but to provide the memory hotplug information to early code, preferably via the primary memory map. KASLR can then make use of it and avoid those regions, just like it avoids other memory regions already. In addition to that hardware makers (including virtualized hardware) should also fix their systems to provide memory hotplug information to early code. Yes, but before this, can we fix this problem by the following patch which has been sent and reviewed by Kees before[1]. its solution is: Extend movable_node option to restrict kernel to be randomized in immovable nodes by adding a parameter. this parameter sets up the boundaries between the home nodes and other nodes. My reason is here: - What we really want to solve is the KASLR breaks *physical Node hotplug*, Keep the decompressed kernel in an immovable node is enough. - AFAICS, there are not too many systems where physical Node hotplug actually works in practice, and there mush be one node called *home node* which is immovable for storing basic information. - the node in modern systems could have double-digit gigabytes memory, It can completely satisfy the operation of KASLR. So, Just restrict kernel to be randomized in the home node, and ignore other nodes when kernel has the *movable_node* option in the command line. Thoughts? may I rebase and resend the patch? [1] https://lkml.org/lkml/2017/8/3/401 Thanks, dou
Re: [PATCH 1/5] x86/smpboot: Add the missing description of possible_cpus
At 03/21/2018 05:34 PM, Dou Liyang wrote: Hi Peter, At 03/21/2018 05:08 PM, Peter Zijlstra wrote: On Wed, Mar 21, 2018 at 01:33:24PM +0800, Dou Liyang wrote: How about: possible_cpus= [s390,x86_64] Set the number of possible CPUs which are determined by the ACPI tables MADT or mptables by default. possible_cpus=n : n >= 1 enforces the possible number to be 'n'. While nr_cpus is also be set: nr_cpus=m, choice the minimum one for the number of possible CPUs. So what is the exact difference between possible_cpus and nr_cpus ? I the possible_cpus= can reset the number of possible CPUs, even bigger than 'num_processors + disabled_cpus', But nr_cpus= can't. ^^^ the maximum number kernel gets from ACPI/mptables, no matter what number nr_cpus= is, the number of possible CPUs will not bigger than it. konw maxcpus= limits the number of CPUs we bring up, and possible_cpus limits the possible_map, but I'm not entirely sure what nr_cpus does here. nr_cpus can limited the maximum CPUs that the kernel could support. Here is a double check in case of using them at the same time, even if I think just using possible_cpus= is enough. :-) Thanks, dou.
Re: [PATCH 1/5] x86/smpboot: Add the missing description of possible_cpus
Hi Peter, At 03/21/2018 05:08 PM, Peter Zijlstra wrote: On Wed, Mar 21, 2018 at 01:33:24PM +0800, Dou Liyang wrote: How about: possible_cpus= [s390,x86_64] Set the number of possible CPUs which are determined by the ACPI tables MADT or mptables by default. possible_cpus=n : n >= 1 enforces the possible number to be 'n'. While nr_cpus is also be set: nr_cpus=m, choice the minimum one for the number of possible CPUs. So what is the exact difference between possible_cpus and nr_cpus ? I the possible_cpus= can reset the number of possible CPUs, even bigger than 'num_processors + disabled_cpus', But nr_cpus= can't. konw maxcpus= limits the number of CPUs we bring up, and possible_cpus limits the possible_map, but I'm not entirely sure what nr_cpus does here. nr_cpus can limited the maximum CPUs that the kernel could support. Here is a double check in case of using them at the same time, even if I think just using possible_cpus= is enough. :-) Thanks, dou.
Re: [PATCH 3/5] x86/smpboot: Make the check code more clear in prefill_possible_map()
Hi Peter, At 03/20/2018 08:39 PM, Peter Zijlstra wrote: On Tue, Mar 20, 2018 at 07:04:30PM +0800, Dou Liyang wrote: case 1: no | no | no | --> min (setup_possible_cpus, nr_cpu_ids, setup_max_cpus) case 2: no | no | yes| --> min (setup_possible_cpus, nr_cpu_ids) case 3: no | yes | no | --> 1 case 4: no | yes | yes| --> 1 case 5: yes | no | no | --> min (num_processors, nr_cpu_ids, setup_max_cpus) case 6: yes | no | yes| --> min (num_processors + disabled_cpus, nr_cpu_ids) case 7: yes | yes | no | --> 1 case 8: yes | yes | yes| --> 1 The case number is off by one ;-) I got it! ;-) Thanks dou
Re: [PATCH 1/5] x86/smpboot: Add the missing description of possible_cpus
Hi Peter, At 03/20/2018 08:37 PM, Peter Zijlstra wrote: On Tue, Mar 20, 2018 at 07:04:28PM +0800, Dou Liyang wrote: + possible_cpus= [s390,x86_64] Use this to set hotpluggable cpus. + This option sets possible_cpus bits in cpu_possible_map. + Thus keeping the numbers of bits set constant even if + the machine gets rebooted. That description, esp. the last sentence, doesn't make any kind of sense to me. What? Ah, sure enough, I can't be lazy. :-) I stole that from the commit 3b11ce7f542e ("x86: use possible_cpus=NUM to extend the possible cpus allowed") How about: possible_cpus= [s390,x86_64] Set the number of possible CPUs which are determined by the ACPI tables MADT or mptables by default. possible_cpus=n : n >= 1 enforces the possible number to be 'n'. While nr_cpus is also be set: nr_cpus=m, choice the minimum one for the number of possible CPUs. Thank, dou
[PATCH 3/5] x86/smpboot: Make the check code more clear in prefill_possible_map()
In prefill_possible_map(), Kernel need to get the number of possible CPUs to reset cpu_possible_map. The number is related to the options: -nosmp, maxcpus, possible_cpus, nr_cpus ... which need to be checked. Currentlly, the check code mixed these options together in confusion and hard to follow. Isolate them, and check them linearly. No functionary change, Prepare for cutting the number of possible CPUs Signed-off-by: Dou Liyang --- Situations: setup_possible_cpus == -1 | setup_max_cpus == 0 | CONFIG_HOTPLUG_CPU == y | yes | yes | yes | no| no | no | Test cases: Cases | the number of possible cpus | (the same with or w/o this patch) case 1: no | no | no | --> min (setup_possible_cpus, nr_cpu_ids, setup_max_cpus) case 2: no | no | yes| --> min (setup_possible_cpus, nr_cpu_ids) case 3: no | yes | no | --> 1 case 4: no | yes | yes| --> 1 case 5: yes | no | no | --> min (num_processors, nr_cpu_ids, setup_max_cpus) case 6: yes | no | yes| --> min (num_processors + disabled_cpus, nr_cpu_ids) case 7: yes | yes | no | --> 1 case 8: yes | yes | yes| --> 1 arch/x86/kernel/smpboot.c | 31 ++- 1 file changed, 18 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index ff99e2b6fc54..2fdda8567bf9 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1334,7 +1334,7 @@ early_param("possible_cpus", _setup_possible_cpus); * We do this because additional CPUs waste a lot of memory. * -AK */ -__init void prefill_possible_map(void) +void __init prefill_possible_map(void) { int i, possible; @@ -1356,18 +1356,22 @@ __init void prefill_possible_map(void) num_processors = 1; } - i = setup_max_cpus ?: 1; + /* The SMP kernel could be acted as an UP kernel via nosmp/maxcpus=0 */ + if (!setup_max_cpus) { + possible = 1; + total_cpus = num_processors + disabled_cpus; + goto set_cpu_possible_mask; + } + + /* Possible CPUs could be overwrited via possible_cpus= */ if (setup_possible_cpus == -1) { possible = num_processors; #ifdef CONFIG_HOTPLUG_CPU - if (setup_max_cpus) - possible += disabled_cpus; -#else - if (possible > i) - possible = i; + possible += disabled_cpus; #endif - } else + } else { possible = setup_possible_cpus; + } total_cpus = max_t(int, possible, num_processors + disabled_cpus); @@ -1378,15 +1382,16 @@ __init void prefill_possible_map(void) possible = nr_cpu_ids; } -#ifdef CONFIG_HOTPLUG_CPU - if (!setup_max_cpus) -#endif - if (possible > i) { +#ifndef CONFIG_HOTPLUG_CPU + /* Possible CPUs could be reduced via max_cpus= */ + if (possible > setup_max_cpus) { pr_warn("%d Processors exceeds max_cpus limit of %u\n", possible, setup_max_cpus); - possible = i; + possible = setup_max_cpus; } +#endif +set_cpu_possible_mask: nr_cpu_ids = possible; pr_info("Allowing %d CPUs, %d hotplug CPUs\n", -- 2.14.3
[PATCH 2/5] x86/cpu_hotplug: Update the link of cpu_hotplug.rst
The original cpu_hotplug.txt documents describing CPU hotplug support in Linux kernel. it was moved in to core-api/ and renamed cpu_hotplug.rst. Update it's link in other documents Fixes: ff58fa7f556c ("Documentation: Update CPU hotplug and move it to core-api") Signed-off-by: Dou Liyang --- Documentation/00-INDEX| 2 -- Documentation/cputopology.txt | 10 +- 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 7f3a0728ccf2..3773c67ea9e5 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -104,8 +104,6 @@ core-api/ - documentation on kernel core components. cpu-freq/ - info on CPU frequency and voltage scaling. -cpu-hotplug.txt - - document describing CPU hotplug support in the Linux kernel. cpu-load.txt - document describing how CPU load statistics are collected. cpuidle/ diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index c6e7e9196a8b..e05b0879fe91 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -117,9 +117,9 @@ source for the output is in brackets ("[]"). [NR_CPUS-1] offline: CPUs that are not online because they have been - HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit - of CPUs allowed by the kernel configuration (kernel_max - above). [~cpu_online_mask + cpus >= NR_CPUS] + HOTPLUGGED off (see core-api/cpu_hotplug.rst) or exceed + the limit of CPUs allowed by the kernel configuration + (kernel_max above). [~cpu_online_mask + cpus >= NR_CPUS] online:CPUs that are online and being scheduled [cpu_online_mask] @@ -155,5 +155,5 @@ online.):: possible: 0-127 present: 0-3 -See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter -as well as more information on the various cpumasks. +See core-api/cpu_hotplug.rst for the possible_cpus=NUM kernel start +parameter as well as more information on the various cpumasks. -- 2.14.3
[PATCH 5/5] acpi/processor: Make the acpi_duplicate_processor_id() static
The acpi_duplicate_processor_id() is only called in acpi_processor_get_info(), So move it in front of acpi_processor_get_info() and make it static. Signed-off-by: Dou Liyang --- drivers/acpi/acpi_processor.c | 62 +-- include/linux/acpi.h | 3 --- 2 files changed, 31 insertions(+), 34 deletions(-) diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index db5bdb59639c..03ec7690710c 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -229,6 +229,37 @@ static inline int acpi_processor_hotadd_init(struct acpi_processor *pr) } #endif /* CONFIG_ACPI_HOTPLUG_CPU */ +/* The number of the unique processor IDs */ +static int nr_unique_ids __initdata; + +/* The number of the duplicate processor IDs */ +static int nr_duplicate_ids; + +/* Used to store the unique processor IDs */ +static int unique_processor_ids[] __initdata = { + [0 ... NR_CPUS - 1] = -1, +}; + +/* Used to store the duplicate processor IDs */ +static int duplicate_processor_ids[] = { + [0 ... NR_CPUS - 1] = -1, +}; + +static bool acpi_duplicate_processor_id(int proc_id) +{ + int i; + + /* +* Compare the proc_id with duplicate IDs, if the proc_id is already +* in the duplicate IDs, return true, otherwise, return false. +*/ + for (i = 0; i < nr_duplicate_ids; i++) { + if (duplicate_processor_ids[i] == proc_id) + return true; + } + return false; +} + static int acpi_processor_get_info(struct acpi_device *device) { union acpi_object object = { 0 }; @@ -579,22 +610,6 @@ static struct acpi_scan_handler processor_container_handler = { .attach = acpi_processor_container_attach, }; -/* The number of the unique processor IDs */ -static int nr_unique_ids __initdata; - -/* The number of the duplicate processor IDs */ -static int nr_duplicate_ids; - -/* Used to store the unique processor IDs */ -static int unique_processor_ids[] __initdata = { - [0 ... NR_CPUS - 1] = -1, -}; - -/* Used to store the duplicate processor IDs */ -static int duplicate_processor_ids[] = { - [0 ... NR_CPUS - 1] = -1, -}; - static void __init processor_validated_ids_update(int proc_id) { int i; @@ -682,21 +697,6 @@ static void __init acpi_processor_check_duplicates(void) NULL, NULL); } -bool acpi_duplicate_processor_id(int proc_id) -{ - int i; - - /* -* compare the proc_id with duplicate IDs, if the proc_id is already -* in the duplicate IDs, return true, otherwise, return false. -*/ - for (i = 0; i < nr_duplicate_ids; i++) { - if (duplicate_processor_ids[i] == proc_id) - return true; - } - return false; -} - void __init acpi_processor_init(void) { acpi_processor_check_duplicates(); diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 968173ec2726..dd4591dc1eb3 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -285,9 +285,6 @@ static inline bool invalid_phys_cpuid(phys_cpuid_t phys_id) return phys_id == PHYS_CPUID_INVALID; } -/* Validate the processor object's proc_id */ -bool acpi_duplicate_processor_id(int proc_id); - #ifdef CONFIG_ACPI_HOTPLUG_CPU /* Arch dependent functions for cpu hotplug support */ int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid, u32 acpi_id, -- 2.14.3
[PATCH 0/5] x86/cpu_hotplug: one bug fix and four cleanup
Recently, we hoped to make the possible CPU count more accurate for Kernel. I stuck on the issue how do I run acpi_early_init() _before_ setup_percpu() is invoked. So send these insignificant patches first. This patchset does this things: - two document-related work(the 1th and 2th patch), - two cleapup work (the 3th and 5th patch) - a bug fix for CPU hotplug(4th patch) Dou Liyang (5): x86/smpboot: Add the missing description of possible_cpus x86/cpu_hotplug: Update the link of cpu_hotplug.rst x86/smpboot: Make the check code more clear in prefill_possible_map() acpi/processor: Fix the return value of acpi_processor_ids_walk() acpi/processor: Make the acpi_duplicate_processor_id() static Documentation/00-INDEX | 2 - Documentation/admin-guide/kernel-parameters.txt | 5 ++ Documentation/cputopology.txt | 10 ++-- Documentation/x86/x86_64/cpu-hotplug-spec | 2 +- arch/x86/kernel/smpboot.c | 31 +++- drivers/acpi/acpi_processor.c | 66 - include/linux/acpi.h| 3 -- 7 files changed, 62 insertions(+), 57 deletions(-) -- 2.14.3
[PATCH 4/5] acpi/processor: Fix the return value of acpi_processor_ids_walk()
ACPI driver should make sure all the processor IDs in their ACPI Namespace are unique for CPU hotplug. the driver performs a depth-first walk of the namespace tree and calls the acpi_processor_ids_walk(). But, the acpi_processor_ids_walk() will return true if one processor is checked, that cause the walk break after walking pass the first processor. Repace the value with AE_OK which is the standard acpi_status value. Fixes 8c8cb30f49b8 ("acpi/processor: Implement DEVICE operator for processor enumeration") Signed-off-by: Dou Liyang --- drivers/acpi/acpi_processor.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index 449d86d39965..db5bdb59639c 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -663,11 +663,11 @@ static acpi_status __init acpi_processor_ids_walk(acpi_handle handle, } processor_validated_ids_update(uid); - return true; + return AE_OK; err: acpi_handle_info(handle, "Invalid processor object\n"); - return false; + return AE_OK; } -- 2.14.3
[PATCH 1/5] x86/smpboot: Add the missing description of possible_cpus
Kernel uses the possible_cpus in command line to reset the possible_cpus bits in cpu_possible_map. It doesn't be recorded in the kernel-parameters.txt Add its description in this document, also replace the wrong option additional_cpus with possible_cpus in cpu-gotplug-spec. Signed-off-by: Dou Liyang --- Documentation/admin-guide/kernel-parameters.txt | 5 + Documentation/x86/x86_64/cpu-hotplug-spec | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1d1d53f85ddd..34f8a5f04e63 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2832,6 +2832,11 @@ variables need be pre-allocated for later physical cpu hot plugging. + possible_cpus= [s390,x86_64] Use this to set hotpluggable cpus. + This option sets possible_cpus bits in cpu_possible_map. + Thus keeping the numbers of bits set constant even if + the machine gets rebooted. + nr_uarts= [SERIAL] maximum number of UARTs to be registered. numa_balancing= [KNL,X86] Enable or disable automatic NUMA balancing. diff --git a/Documentation/x86/x86_64/cpu-hotplug-spec b/Documentation/x86/x86_64/cpu-hotplug-spec index 3c23e0587db3..d218bc0bdaaa 100644 --- a/Documentation/x86/x86_64/cpu-hotplug-spec +++ b/Documentation/x86/x86_64/cpu-hotplug-spec @@ -16,6 +16,6 @@ it should have its LAPIC Enabled bit set to 0. Linux will use the number of disabled LAPICs to compute the maximum number of future CPUs. In the worst case the user can overwrite this choice using a command line -option (additional_cpus=...), but it is recommended to supply the correct +option (possible_cpus=...), but it is recommended to supply the correct number (or a reasonable approximation of it, with erring towards more not less) in the MADT to avoid manual configuration. -- 2.14.3
Re: [ACPI / processor] d619c81e24: WARNING:at_include/linux/cpumask.h:#cpumask_test_cpu
Hi lkp team Thank you for testing. At 03/16/2018 07:17 PM, kernel test robot wrote: FYI, we noticed the following commit (built with gcc-7): commit: d619c81e246424e322f7a902bed6e60b90668d56 ("ACPI / processor: Get accurate possible CPU count") url: https://github.com/0day-ci/linux/commits/Dou-Liyang/ACPI-processor-Get-accurate-possible-CPU-count/20180316-140349 in testcase: boot on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 1G caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): +--+++ | | 3266b5bd97 | d619c81e24 | +--+++ | boot_successes | 16 | 2 | | boot_failures| 0 | 18 | | WARNING:at_include/linux/cpumask.h:#cpumask_test_cpu | 0 | 18 | | WARNING:at_include/linux/cpumask.h:#cpumask_check| 0 | 18 | | RIP:cpumask_check| 0 | 18 | | RIP:cpumask_test_cpu | 0 | 18 | | general_protection_fault:#[##] | 0 | 13 | | RIP:__lock_acquire | 0 | 13 | | Kernel_panic-not_syncing:Fatal_exception | 0 | 13 | | BUG:kernel_hang_in_boot_stage| 0 | 5 | +--+++ [0.830741] WARNING: CPU: 1 PID: 1 at include/linux/cpumask.h:122 cpumask_test_cpu+0x32/0x57 [0.830785] WARNING: CPU: 1 PID: 1 at include/linux/cpumask.h:122 cpumask_check+0x2d/0x48 Yes, this patch broke the number of CPUs, we will find out a new way, please drop the test work of this patch. Thanks, dou
Re: [PATCH] kernel/cpu: Move cpuhp_is_atomic_state() into #ifdef CONFIG_SMP
At 03/16/2018 03:59 PM, Thomas Gleixner wrote: On Fri, 16 Mar 2018, Dou Liyang wrote: Commit 17a2f1ced028 ("cpu/hotplug: Merge cpuhp_bp_states and cpuhp_ap_states") removed the last use of cpuhp_is_atomic_state() in common case, that caused Kernel produced a warning: 'cpuhp_is_ap_state' defined but not used So, Move it into #ifdef CONFIG_SMP Thanks for the patch, but a fix is already queued in the smp/hotplug branch. It just did not make it into the for-next branch yet, Oops, yes, let's ignore this patch. Thanks, dou Thanks, tglx
[PATCH] kernel/cpu: Move cpuhp_is_atomic_state() into #ifdef CONFIG_SMP
Commit 17a2f1ced028 ("cpu/hotplug: Merge cpuhp_bp_states and cpuhp_ap_states") removed the last use of cpuhp_is_atomic_state() in common case, that caused Kernel produced a warning: 'cpuhp_is_ap_state' defined but not used So, Move it into #ifdef CONFIG_SMP Reported-by: Stephen Rothwell Signed-off-by: Dou Liyang --- kernel/cpu.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index a1860d42aacf..db7ec7572348 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -126,15 +126,6 @@ struct cpuhp_step { static DEFINE_MUTEX(cpuhp_state_mutex); static struct cpuhp_step cpuhp_hp_states[]; -static bool cpuhp_is_ap_state(enum cpuhp_state state) -{ - /* -* The extra check for CPUHP_TEARDOWN_CPU is only for documentation -* purposes as that state is handled explicitly in cpu_down. -*/ - return state > CPUHP_BRINGUP_CPU && state != CPUHP_TEARDOWN_CPU; -} - static struct cpuhp_step *cpuhp_get_step(enum cpuhp_state state) { return cpuhp_hp_states + state; @@ -247,6 +238,15 @@ static inline void complete_ap_thread(struct cpuhp_cpu_state *st, bool bringup) complete(done); } +static bool cpuhp_is_ap_state(enum cpuhp_state state) +{ + /* +* The extra check for CPUHP_TEARDOWN_CPU is only for documentation +* purposes as that state is handled explicitly in cpu_down. +*/ + return state > CPUHP_BRINGUP_CPU && state != CPUHP_TEARDOWN_CPU; +} + /* * The former STARTING/DYING states, ran with IRQs disabled and must not fail. */ -- 2.14.3