Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
On Fri, Aug 27, 2021, Mathieu Desnoyers wrote: > > So there are effectively three reasons we want a delay: > > > > 1. To allow sched_setaffinity() to coincide with ioctl(KVM_RUN) before KVM > > can > > enter the guest so that the guest doesn't need an arch-specific VM-Exit > > source. > > > > 2. To let ioctl(KVM_RUN) make its way back to the test before the next > > round > > of migration. > > > > 3. To ensure the read-side can make forward progress, e.g. if > > sched_getcpu() > > involves a syscall. > > > > > > After looking at KVM for arm64 and s390, #1 is a bit tenuous because x86 is > > the > > only arch that currently uses xfer_to_guest_mode_work(), i.e. the test > > could be > > tweaked to be overtly x86-specific. But since a delay is needed for #2 and > > #3, > > I'd prefer to rely on it for #1 as well in the hopes that this test provides > > coverage for arm64 and/or s390 if they're ever converted to use the common > > xfer_to_guest_mode_work(). > > Now that we have this understanding of why we need the delay, it would be > good to > write this down in a comment within the test. Ya, I'll get a new version out next week. > Does it reproduce if we randomize the delay to have it picked randomly from > 0us > to 100us (with 1us step) ? It would remove a lot of the needs for > arch-specific > magic delay value. My less-than-scientific testing shows that it can reproduce at delays up to ~500us, but above ~10us the reproducibility starts to drop. The bug still reproduces reliably, it just takes more iterations, and obviously the test runs a bit slower. Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)?
Re: [PATCH] net: spider_net: switch from 'pci_' to 'dma_' API
Hi Christophe, On 8/27/21 12:56 PM, Christophe JAILLET wrote: > It has *not* been compile tested because I don't have the needed > configuration or cross-compiler. The powerpc ppc64_defconfig has CONFIG_SPIDER_NET set. My tdd-builder Docker image has the needed gcc-powerpc-linux-gnu cross compiler to build ppc64_defconfig: https://hub.docker.com/r/glevand/tdd-builder -Geoff
Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
- On Aug 27, 2021, at 7:23 PM, Sean Christopherson sea...@google.com wrote: > On Fri, Aug 27, 2021, Mathieu Desnoyers wrote: [...] >> Does it reproduce if we randomize the delay to have it picked randomly from >> 0us >> to 100us (with 1us step) ? It would remove a lot of the needs for >> arch-specific >> magic delay value. > > My less-than-scientific testing shows that it can reproduce at delays up to > ~500us, > but above ~10us the reproducibility starts to drop. The bug still reproduces > reliably, it just takes more iterations, and obviously the test runs a bit > slower. > > Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)? Works for me, thanks! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
[PATCH] net: spider_net: switch from 'pci_' to 'dma_' API
In [1], Christoph Hellwig has proposed to remove the wrappers in include/linux/pci-dma-compat.h. Some reasons why this API should be removed have been given by Julia Lawall in [2]. A coccinelle script has been used to perform the needed transformation Only relevant parts are given below. @@ @@ -PCI_DMA_BIDIRECTIONAL +DMA_BIDIRECTIONAL @@ @@ -PCI_DMA_TODEVICE +DMA_TO_DEVICE @@ @@ -PCI_DMA_FROMDEVICE +DMA_FROM_DEVICE @@ expression e1, e2, e3, e4; @@ -pci_map_single(e1, e2, e3, e4) +dma_map_single(&e1->dev, e2, e3, e4) @@ expression e1, e2, e3, e4; @@ -pci_unmap_single(e1, e2, e3, e4) +dma_unmap_single(&e1->dev, e2, e3, e4) @@ expression e1, e2; @@ -pci_dma_mapping_error(e1, e2) +dma_mapping_error(&e1->dev, e2) [1]: https://lore.kernel.org/kernel-janitors/20200421081257.ga131...@infradead.org/ [2]: https://lore.kernel.org/kernel-janitors/alpine.DEB.2.22.394.2007120902170.2424@hadrien/ Signed-off-by: Christophe JAILLET --- It has *not* been compile tested because I don't have the needed configuration or cross-compiler. However, the modification is completely mechanical and done by coccinelle. --- drivers/net/ethernet/toshiba/spider_net.c | 27 +-- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/toshiba/spider_net.c b/drivers/net/ethernet/toshiba/spider_net.c index 087f0af56c50..66d4e024d11e 100644 --- a/drivers/net/ethernet/toshiba/spider_net.c +++ b/drivers/net/ethernet/toshiba/spider_net.c @@ -354,9 +354,10 @@ spider_net_free_rx_chain_contents(struct spider_net_card *card) descr = card->rx_chain.head; do { if (descr->skb) { - pci_unmap_single(card->pdev, descr->hwdescr->buf_addr, + dma_unmap_single(&card->pdev->dev, +descr->hwdescr->buf_addr, SPIDER_NET_MAX_FRAME, -PCI_DMA_BIDIRECTIONAL); +DMA_BIDIRECTIONAL); dev_kfree_skb(descr->skb); descr->skb = NULL; } @@ -411,9 +412,9 @@ spider_net_prepare_rx_descr(struct spider_net_card *card, if (offset) skb_reserve(descr->skb, SPIDER_NET_RXBUF_ALIGN - offset); /* iommu-map the skb */ - buf = pci_map_single(card->pdev, descr->skb->data, - SPIDER_NET_MAX_FRAME, PCI_DMA_FROMDEVICE); - if (pci_dma_mapping_error(card->pdev, buf)) { + buf = dma_map_single(&card->pdev->dev, descr->skb->data, +SPIDER_NET_MAX_FRAME, DMA_FROM_DEVICE); + if (dma_mapping_error(&card->pdev->dev, buf)) { dev_kfree_skb_any(descr->skb); descr->skb = NULL; if (netif_msg_rx_err(card) && net_ratelimit()) @@ -653,8 +654,9 @@ spider_net_prepare_tx_descr(struct spider_net_card *card, dma_addr_t buf; unsigned long flags; - buf = pci_map_single(card->pdev, skb->data, skb->len, PCI_DMA_TODEVICE); - if (pci_dma_mapping_error(card->pdev, buf)) { + buf = dma_map_single(&card->pdev->dev, skb->data, skb->len, +DMA_TO_DEVICE); + if (dma_mapping_error(&card->pdev->dev, buf)) { if (netif_msg_tx_err(card) && net_ratelimit()) dev_err(&card->netdev->dev, "could not iommu-map packet (%p, %i). " "Dropping packet\n", skb->data, skb->len); @@ -666,7 +668,8 @@ spider_net_prepare_tx_descr(struct spider_net_card *card, descr = card->tx_chain.head; if (descr->next == chain->tail->prev) { spin_unlock_irqrestore(&chain->lock, flags); - pci_unmap_single(card->pdev, buf, skb->len, PCI_DMA_TODEVICE); + dma_unmap_single(&card->pdev->dev, buf, skb->len, +DMA_TO_DEVICE); return -ENOMEM; } hwdescr = descr->hwdescr; @@ -822,8 +825,8 @@ spider_net_release_tx_chain(struct spider_net_card *card, int brutal) /* unmap the skb */ if (skb) { - pci_unmap_single(card->pdev, buf_addr, skb->len, - PCI_DMA_TODEVICE); + dma_unmap_single(&card->pdev->dev, buf_addr, skb->len, +DMA_TO_DEVICE); dev_consume_skb_any(skb); } } @@ -1165,8 +1168,8 @@ spider_net_decode_one_descr(struct spider_net_card *card) /* unmap descriptor */ hw_buf_addr = hwdescr->buf_addr; hwdescr->buf_addr = 0x; - pci_unmap_single(card->pdev, hw_buf_addr, - SPIDER_NET_MAX_FRAME, PCI_DMA_FROMDEVICE); + dma_unmap_single(&card->pdev->dev, hw_buf_addr, SPIDER_NET_MAX_FRAME, +
Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance
On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote: > On Tue, 10 Aug 2021 at 16:41, Ricardo Neri > wrote: > > @@ -9540,6 +9629,12 @@ static struct rq *find_busiest_queue(struct lb_env > > *env, > > nr_running == 1) > > continue; > > > > + /* Make sure we only pull tasks from a CPU of lower > > priority */ > > + if ((env->sd->flags & SD_ASYM_PACKING) && > > + sched_asym_prefer(i, env->dst_cpu) && > > + nr_running == 1) > > + continue; > > This really looks similar to the test above for SD_ASYM_CPUCAPACITY. > More generally speaking SD_ASYM_PACKING and SD_ASYM_CPUCAPACITY share > a lot of common policy and I wonder if at some point we could not > merge their behavior in LB I would like to confirm with you that you are not expecting this merge as part of this series, right? Thanks and BR, Ricardo
Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance
On Fri, Aug 27, 2021 at 05:17:22PM +0200, Vincent Guittot wrote: > On Fri, 27 Aug 2021 at 16:50, Peter Zijlstra wrote: > > > > On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote: > > > > +/** > > > > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can > > > > pull tasks > > > > + * @dst_cpu: Destination CPU of the load balancing > > > > + * @sds: Load-balancing data with statistics of the local group > > > > + * @sgs: Load-balancing statistics of the candidate busiest group > > > > + * @sg:The candidate busiet group > > > > + * > > > > + * Check the state of the SMT siblings of both @sds::local and @sg and > > > > decide > > > > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, > > > > it can > > > > + * pull tasks if two or more of the SMT siblings of @sg are busy. If > > > > only one > > > > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority. > > > > + * > > > > + * If both @dst_cpu and @sg have SMT siblings, even the number of idle > > > > CPUs > > > > + * between @sds::local and @sg. Thus, pull tasks from @sg if the > > > > difference > > > > + * between the number of busy CPUs is 2 or more. If the difference is > > > > of 1, > > > > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT > > > > siblings > > > > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and > > > > @sg > > > > + * has lower priority. > > > > + */ > > > > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats > > > > *sds, > > > > + struct sg_lb_stats *sgs, > > > > + struct sched_group *sg) > > > > +{ > > > > +#ifdef CONFIG_SCHED_SMT > > > > + bool local_is_smt, sg_is_smt; > > > > + int sg_busy_cpus; > > > > + > > > > + local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY; > > > > + sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY; > > > > + > > > > + sg_busy_cpus = sgs->group_weight - sgs->idle_cpus; > > > > + > > > > + if (!local_is_smt) { > > > > + /* > > > > +* If we are here, @dst_cpu is idle and does not have > > > > SMT > > > > +* siblings. Pull tasks if candidate group has two or > > > > more > > > > +* busy CPUs. > > > > +*/ > > > > + if (sg_is_smt && sg_busy_cpus >= 2) > > > > + return true; > > > > + > > > > + /* > > > > +* @dst_cpu does not have SMT siblings. @sg may have SMT > > > > +* siblings and only one is busy. In such case, @dst_cpu > > > > +* can help if it has higher priority and is idle. > > > > +*/ > > > > + return !sds->local_stat.group_util && > > > > > > sds->local_stat.group_util can't be used to decide if a CPU or group > > > of CPUs is idle. util_avg is usually not null when a CPU becomes idle > > > and you can have to wait more than 300ms before it becomes Null > > > At the opposite, the utilization of a CPU can be null but a task with > > > null utilization has just woken up on it. > > > Utilization is used to reflect the average work of the CPU or group of > > > CPUs but not the current state > > > > If you want immediate idle, sgs->nr_running == 0 or sgs->idle_cpus == > > sgs->group_weight come to mind. > > yes, I have the same in mind Thank you very much Vincent and Peter for the feedback! I'll look at using these stats to determine immediate idle. > > > > > > > + sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu); > > > > + } > > > > + > > > > + /* @dst_cpu has SMT siblings. */ > > > > + > > > > + if (sg_is_smt) { > > > > + int local_busy_cpus = sds->local->group_weight - > > > > + sds->local_stat.idle_cpus; > > > > + int busy_cpus_delta = sg_busy_cpus - local_busy_cpus; > > > > + > > > > + /* Local can always help to even the number busy CPUs. > > > > */ > > > > > > default behavior of the load balance already tries to even the number > a> > of idle CPUs. > > > > Right, but I suppose this is because we're trapped here and have to deal > > with the SMT-SMT case too. Ricardo, can you clarify? > > IIUC, this function is used in sg_lb_stats to set > sgs->group_asym_packing which is then used to set the group state to > group_asym_packing and force asym migration. > But if we only want to even the number of busy CPUs between the group, > we should not need to set the group state to group_asym_packing Yes, what Vincent describe is the intent. Then, I think it is probably true that it is not necessary to even the number of idle CPUs here. > > > > > > > + if (busy_cpus_delta >= 2) > > > > + return true; > > > > + > > > > + if (busy_cpus_delta == 1) > > > > +
Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
- On Aug 26, 2021, at 7:54 PM, Sean Christopherson sea...@google.com wrote: > On Thu, Aug 26, 2021, Mathieu Desnoyers wrote: >> - On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com >> wrote: >> >> >> + r = sched_setaffinity(0, sizeof(allowed_mask), >> >> >> &allowed_mask); >> >> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d >> >> >> (%s)", >> >> >> + errno, strerror(errno)); >> >> >> + atomic_inc(&seq_cnt); >> >> >> + >> >> >> + CPU_CLR(cpu, &allowed_mask); >> >> >> + >> >> >> + /* >> >> >> + * Let the read-side get back into KVM_RUN to improve >> >> >> the odds >> >> >> + * of task migration coinciding with KVM's run loop. >> >> > >> >> > This comment should be about increasing the odds of letting the seqlock >> >> > read-side complete. Otherwise, the delay between the two back-to-back >> >> > atomic_inc is so small that the seqlock read-side may never have time to >> >> > complete the reading the rseq cpu id and the sched_getcpu() call, and >> >> > can >> >> > retry forever. >> > >> > Hmm, but that's not why there's a delay. I'm not arguing that a livelock >> > isn't >> > possible (though that syscall would have to be screaming fast), >> >> I don't think we have the same understanding of the livelock scenario. AFAIU >> the >> livelock >> would be caused by a too-small delay between the two consecutive atomic_inc() >> from one >> loop iteration to the next compared to the time it takes to perform a >> read-side >> critical >> section of the seqlock. Back-to-back atomic_inc can be performed very >> quickly, >> so I >> doubt that the sched_getcpu implementation have good odds to be fast enough >> to >> complete >> in that narrow window, leading to lots of read seqlock retry. > > Ooooh, yeah, brain fart on my side. I was thinking of the two atomic_inc() in > the > same loop iteration and completely ignoring the next iteration. Yes, I 100% > agree > that a delay to ensure forward progress is needed. An assertion in main() > that > the > reader complete at least some reasonable number of KVM_RUNs is also probably a > good > idea, e.g. to rule out a false pass due to the reader never making forward > progress. Agreed. > > FWIW, the do-while loop does make forward progress without a delay, but at > ~50% > throughput, give or take. I did not expect absolutely no progress, but a significant slow down of the read-side. > >> > but the primary motivation is very much to allow the read-side enough time >> > to get back into KVM proper. >> >> I'm puzzled by your statement. AFAIU, let's say we don't have the delay, >> then we >> have: >> >> migration thread KVM_RUN/read-side thread >> --- >> - ioctl(KVM_RUN) >> - atomic_inc_seq_cst(&seqcnt) >> - sched_setaffinity >> - atomic_inc_seq_cst(&seqcnt) >> - a = atomic_load(&seqcnt) & ~1 >> - smp_rmb() >> - b = >> LOAD_ONCE(__rseq_abi->cpu_id); >> - sched_getcpu() >> - smp_rmb() >> - re-load seqcnt/compare >> (succeeds) >>- Can only succeed if entire >> read-side happens while the seqcnt >> is in an even numbered >> state. >> - if (a != b) abort() >> /* no delay. Even counter state is very >> short. */ >> - atomic_inc_seq_cst(&seqcnt) >> /* Let's suppose the lack of delay causes the >> setaffinity to complete too early compared >> with KVM_RUN ioctl */ >> - sched_setaffinity >> - atomic_inc_seq_cst(&seqcnt) >> >> /* no delay. Even counter state is very >> short. */ >> - atomic_inc_seq_cst(&seqcnt) >> /* Then a setaffinity from a following >> migration thread loop will run >> concurrently with KVM_RUN */ >> - ioctl(KVM_RUN) >> - sched_setaffinity >> - atomic_inc_seq_cst(&seqcnt) >> >> As pointed out here, if the first setaffinity runs too early compared with >> KVM_RUN, >> a following setaffinity will run concurrently with it. However, the fact that >> the even counter state is very short may very well hurt progress of the read >> seqlock. > > *sigh* > > Several hours later, I think I finally have my head wrapped around everything. > > Due to the way the test is written and because of how KVM's run loop works, > TIF_NOTIFY_RESUME or TIF_NEED_RESCHED effectively has to be set before KVM > actually > enters the guest, otherwise KVM will exit to userspa
Re: [FSL P50x0] lscpu reports wrong values since the RC1 of kernel 5.13
> On 26. Aug 2021, at 05:43, Srikar Dronamraju > wrote: > > * Christian Zigotzky [2021-08-16 14:29:21]: > > > Hi Christian, > >> I tested the RC6 of kernel 5.14 today and unfortunately the issue still >> exists. We have figured out that only P5040 SoCs are affected. [1] >> P5020 SoCs display the correct values. >> Please check the CPU changes in the PowerPC updates 5.13-1 and 5.13-2. >> > > Thanks for reporting the issue. > Would it be possible to try > https://lore.kernel.org/linuxppc-dev/20210821092419.167454-3-sri...@linux.vnet.ibm.com/t/#u Hi Srikar, This patch works! Thanks a lot! Cheers, Christian > > If the above patch is not helping, then can you please collect the output of > > cat /sys/devices/system/cpu/cpu*/topology/core_siblings > > Were all the CPUs online at the time of boot? > Did we do any CPU online/offline operations post boot? > > If we did CPU online/offline, can you capture the output just after the > boot along with lscpu output.. > > Since this is being seen on few SOCs, can you summarize the difference > between P5040 and P5020. >> >> [1] https://forum.hyperion-entertainment.com/viewtopic.php?p=53775#p53775 >> >> >>> On 09 August 2021 um 02:37 pm, Christian Zigotzky wrote: >>> Hi All, >>> >>> Lscpu reports wrong values [1] since the RC1 of kernel 5.13 on my FSL >>> P5040 Cyrus+ board (A-EON AmigaOne X5000). [2] >>> The differences are: >>> >>> Since the RC1 of kernel 5.13 (wrong values): >>> >>> Core(s) per socket: 1 >>> Socket(s): 3 >>> > > I know that the socket count was off by 1, but I cant explain how its off by > 2 here. > >>> Before (correct values): >>> >>> Core(s) per socket: 4 >>> Socket(s): 1 >>> > > -- > Thanks and Regards > Srikar Dronamraju
Re: [PATCH v6 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping
Hello Fred, On Fri, 2021-08-27 at 19:06 +0200, Frederic Barrat wrote: > > I think it looks ok now as it was mostly me who was misunderstanding > one > part of the previous iteration. > Reviewed-by: Frederic Barrat > Thank you for reviewing this series! > Sorry for the late review, I was enjoying some time off. And thanks > for > that series, I believe it should help with those bugs complaining > about > lack of DMA space. It was also very educational for me, thanks to you > and Alexey for your detailed answers. Working on this series caused me to learn a lot about how DMA and IOMMU works in Power, and it was a great experience. Thanks to Alexey who helped and guided me through this! Best regards, Leonardo Bras
Re: [PATCH v6 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping
On 17/08/2021 08:39, Leonardo Bras wrote: So far it's assumed possible to map the guest RAM 1:1 to the bus, which works with a small number of devices. SRIOV changes it as the user can configure hundreds VFs and since phyp preallocates TCEs and does not allow IOMMU pages bigger than 64K, it has to limit the number of TCEs per a PE to limit waste of physical pages. As of today, if the assumed direct mapping is not possible, DDW creation is skipped and the default DMA window "ibm,dma-window" is used instead. By using DDW, indirect mapping can get more TCEs than available for the default DMA window, and also get access to using much larger pagesizes (16MB as implemented in qemu vs 4k from default DMA window), causing a significant increase on the maximum amount of memory that can be IOMMU mapped at the same time. Indirect mapping will only be used if direct mapping is not a possibility. For indirect mapping, it's necessary to re-create the iommu_table with the new DMA window parameters, so iommu_alloc() can use it. Removing the default DMA window for using DDW with indirect mapping is only allowed if there is no current IOMMU memory allocated in the iommu_table. enable_ddw() is aborted otherwise. Even though there won't be both direct and indirect mappings at the same time, we can't reuse the DIRECT64_PROPNAME property name, or else an older kexec()ed kernel can assume direct mapping, and skip iommu_alloc(), causing undesirable behavior. So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info" was created to represent a DDW that does not allow direct mapping. Signed-off-by: Leonardo Bras --- I think it looks ok now as it was mostly me who was misunderstanding one part of the previous iteration. Reviewed-by: Frederic Barrat Sorry for the late review, I was enjoying some time off. And thanks for that series, I believe it should help with those bugs complaining about lack of DMA space. It was also very educational for me, thanks to you and Alexey for your detailed answers. Fred arch/powerpc/platforms/pseries/iommu.c | 89 +- 1 file changed, 74 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index e11c00b2dc1e..0eccc29f5573 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock); /* protects initializing window twice for same device */ static DEFINE_MUTEX(direct_window_init_mutex); #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" +#define DMA64_PROPNAME "linux,dma64-ddr-window-info" static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn, unsigned long num_pfn, const void *arg) @@ -940,6 +941,7 @@ static int find_existing_ddw_windows(void) return 0; find_existing_ddw_windows_named(DIRECT64_PROPNAME); + find_existing_ddw_windows_named(DMA64_PROPNAME); return 0; } @@ -1226,14 +1228,17 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) struct ddw_create_response create; int page_shift; u64 win_addr; + const char *win_name; struct device_node *dn; u32 ddw_avail[DDW_APPLICABLE_SIZE]; struct direct_window *window; struct property *win64; bool ddw_enabled = false; struct failed_ddw_pdn *fpdn; - bool default_win_removed = false; + bool default_win_removed = false, direct_mapping = false; bool pmem_present; + struct pci_dn *pci = PCI_DN(pdn); + struct iommu_table *tbl = pci->table_group->tables[0]; dn = of_find_node_by_type(NULL, "ibm,pmemory"); pmem_present = dn != NULL; @@ -1242,6 +1247,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) mutex_lock(&direct_window_init_mutex); if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &len)) { + direct_mapping = (len >= max_ram_len); ddw_enabled = true; goto out_unlock; } @@ -1322,8 +1328,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) query.page_size); goto out_failed; } - /* verify the window * number of ptes will map the partition */ - /* check largest block * page size > max memory hotplug addr */ + + /* * The "ibm,pmemory" can appear anywhere in the address space. * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS @@ -1339,13 +1345,25 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) dev_info(&dev->dev, "Skipping ibm,pmemory"); } + /* check if the available block * number of ptes will map everything */ if (query.largest_available_block < (1ULL << (len - page_shift))) {
Re: [PATCH v6 08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name
On 17/08/2021 08:39, Leonardo Bras wrote: Update remove_dma_window() so it can be used to remove DDW with a given property name. This enables the creation of new property names for DDW, so we can have different usage for it, like indirect mapping. Also, add return values to it so we can check if the property was found while removing the active DDW. This allows skipping the remaining property names while reducing the impact of multiple property names. Signed-off-by: Leonardo Bras Reviewed-by: Alexey Kardashevskiy --- Reviewed-by: Frederic Barrat arch/powerpc/platforms/pseries/iommu.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index a47f59a8f107..901f290999d0 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -844,31 +844,33 @@ static void remove_dma_window(struct device_node *np, u32 *ddw_avail, __remove_dma_window(np, ddw_avail, liobn); } -static void remove_ddw(struct device_node *np, bool remove_prop) +static int remove_ddw(struct device_node *np, bool remove_prop, const char *win_name) { struct property *win; u32 ddw_avail[DDW_APPLICABLE_SIZE]; int ret = 0; + win = of_find_property(np, win_name, NULL); + if (!win) + return -EINVAL; + ret = of_property_read_u32_array(np, "ibm,ddw-applicable", &ddw_avail[0], DDW_APPLICABLE_SIZE); if (ret) - return; + return 0; - win = of_find_property(np, DIRECT64_PROPNAME, NULL); - if (!win) - return; if (win->length >= sizeof(struct dynamic_dma_window_prop)) remove_dma_window(np, ddw_avail, win); if (!remove_prop) - return; + return 0; ret = of_remove_property(np, win); if (ret) pr_warn("%pOF: failed to remove direct window property: %d\n", np, ret); + return 0; } static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift) @@ -921,7 +923,7 @@ static int find_existing_ddw_windows(void) for_each_node_with_property(pdn, DIRECT64_PROPNAME) { direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len); if (!direct64 || len < sizeof(*direct64)) { - remove_ddw(pdn, true); + remove_ddw(pdn, true, DIRECT64_PROPNAME); continue; } @@ -1565,7 +1567,7 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti * we have to remove the property when releasing * the device node. */ - remove_ddw(np, false); + remove_ddw(np, false, DIRECT64_PROPNAME); if (pci && pci->table_group) iommu_pseries_free_group(pci->table_group, np->full_name);
Re: [PATCH v6 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()
On 17/08/2021 08:39, Leonardo Bras wrote: Code used to create a ddw property that was previously scattered in enable_ddw() is now gathered in ddw_property_create(), which deals with allocation and filling the property, letting it ready for of_property_add(), which now occurs in sequence. This created an opportunity to reorganize the second part of enable_ddw(): Without this patch enable_ddw() does, in order: kzalloc() property & members, create_ddw(), fill ddwprop inside property, ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory, of_add_property(), and list_add(). With this patch enable_ddw() does, in order: create_ddw(), ddw_property_create(), of_add_property(), ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory, and list_add(). This change requires of_remove_property() in case anything fails after of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk in all memory, which looks the most expensive operation, only if everything else succeeds. Also, the error path got remove_ddw() replaced by a new helper __remove_dma_window(), which only removes the new DDW with an rtas-call. For this, a new helper clean_dma_window() was needed to clean anything that could left if walk_system_ram_range() fails. Signed-off-by: Leonardo Bras Reviewed-by: Alexey Kardashevskiy --- Thanks for fixing the error paths Reviewed-by: Frederic Barrat arch/powerpc/platforms/pseries/iommu.c | 129 - 1 file changed, 84 insertions(+), 45 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index b34b473bbdc1..00392582fe10 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -795,17 +795,10 @@ static int __init disable_ddw_setup(char *str) early_param("disable_ddw", disable_ddw_setup); -static void remove_dma_window(struct device_node *np, u32 *ddw_avail, - struct property *win) +static void clean_dma_window(struct device_node *np, struct dynamic_dma_window_prop *dwp) { - struct dynamic_dma_window_prop *dwp; - u64 liobn; int ret; - dwp = win->value; - liobn = (u64)be32_to_cpu(dwp->liobn); - - /* clear the whole window, note the arg is in kernel pages */ ret = tce_clearrange_multi_pSeriesLP(0, 1ULL << (be32_to_cpu(dwp->window_shift) - PAGE_SHIFT), dwp); if (ret) @@ -814,18 +807,39 @@ static void remove_dma_window(struct device_node *np, u32 *ddw_avail, else pr_debug("%pOF successfully cleared tces in window.\n", np); +} + +/* + * Call only if DMA window is clean. + */ +static void __remove_dma_window(struct device_node *np, u32 *ddw_avail, u64 liobn) +{ + int ret; ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn); if (ret) - pr_warn("%pOF: failed to remove direct window: rtas returned " + pr_warn("%pOF: failed to remove DMA window: rtas returned " "%d to ibm,remove-pe-dma-window(%x) %llx\n", np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); else - pr_debug("%pOF: successfully removed direct window: rtas returned " + pr_debug("%pOF: successfully removed DMA window: rtas returned " "%d to ibm,remove-pe-dma-window(%x) %llx\n", np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); } +static void remove_dma_window(struct device_node *np, u32 *ddw_avail, + struct property *win) +{ + struct dynamic_dma_window_prop *dwp; + u64 liobn; + + dwp = win->value; + liobn = (u64)be32_to_cpu(dwp->liobn); + + clean_dma_window(np, dwp); + __remove_dma_window(np, ddw_avail, liobn); +} + static void remove_ddw(struct device_node *np, bool remove_prop) { struct property *win; @@ -1153,6 +1167,35 @@ static int iommu_get_page_shift(u32 query_page_size) return 0; } +static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr, + u32 page_shift, u32 window_shift) +{ + struct dynamic_dma_window_prop *ddwprop; + struct property *win64; + + win64 = kzalloc(sizeof(*win64), GFP_KERNEL); + if (!win64) + return NULL; + + win64->name = kstrdup(propname, GFP_KERNEL); + ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL); + win64->value = ddwprop; + win64->length = sizeof(*ddwprop); + if (!win64->name || !win64->value) { + kfree(win64->name); + kfree(win64->value); + kfree(win64); + return NULL; + } + + ddwprop->liobn = cpu_to_be32(liobn); + ddwprop->dma_base = cpu_to_be64(dma_addr); + ddwprop->tce_shift = cpu_to_b
Re: [PATCH v6 02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper
On 17/08/2021 08:39, Leonardo Bras wrote: Having a function to check if the iommu table has any allocation helps deciding if a tbl can be reset for using a new DMA window. It should be enough to replace all instances of !bitmap_empty(tbl...). iommu_table_in_use() skips reserved memory, so we don't need to worry about releasing it before testing. This causes iommu_table_release_pages() to become unnecessary, given it is only used to remove reserved memory for testing. Also, only allow storing reserved memory values in tbl if they are valid in the table, so there is no need to check it in the new helper. Signed-off-by: Leonardo Bras Reviewed-by: Alexey Kardashevskiy --- Looks ok to me now, thanks! Reviewed-by: Frederic Barrat arch/powerpc/include/asm/iommu.h | 1 + arch/powerpc/kernel/iommu.c | 61 2 files changed, 32 insertions(+), 30 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index deef7c94d7b6..bf3b84128525 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl); */ extern struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid, unsigned long res_start, unsigned long res_end); +bool iommu_table_in_use(struct iommu_table *tbl); #define IOMMU_TABLE_GROUP_MAX_TABLES 2 diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 2af89a5e379f..ed98ad63633e 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -690,32 +690,24 @@ static void iommu_table_reserve_pages(struct iommu_table *tbl, if (tbl->it_offset == 0) set_bit(0, tbl->it_map); - tbl->it_reserved_start = res_start; - tbl->it_reserved_end = res_end; - - /* Check if res_start..res_end isn't empty and overlaps the table */ - if (res_start && res_end && - (tbl->it_offset + tbl->it_size < res_start || -res_end < tbl->it_offset)) - return; + if (res_start < tbl->it_offset) + res_start = tbl->it_offset; - for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i) - set_bit(i - tbl->it_offset, tbl->it_map); -} + if (res_end > (tbl->it_offset + tbl->it_size)) + res_end = tbl->it_offset + tbl->it_size; -static void iommu_table_release_pages(struct iommu_table *tbl) -{ - int i; + /* Check if res_start..res_end is a valid range in the table */ + if (res_start >= res_end) { + tbl->it_reserved_start = tbl->it_offset; + tbl->it_reserved_end = tbl->it_offset; + return; + } - /* -* In case we have reserved the first bit, we should not emit -* the warning below. -*/ - if (tbl->it_offset == 0) - clear_bit(0, tbl->it_map); + tbl->it_reserved_start = res_start; + tbl->it_reserved_end = res_end; for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i) - clear_bit(i - tbl->it_offset, tbl->it_map); + set_bit(i - tbl->it_offset, tbl->it_map); } /* @@ -779,6 +771,22 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid, return tbl; } +bool iommu_table_in_use(struct iommu_table *tbl) +{ + unsigned long start = 0, end; + + /* ignore reserved bit0 */ + if (tbl->it_offset == 0) + start = 1; + end = tbl->it_reserved_start - tbl->it_offset; + if (find_next_bit(tbl->it_map, end, start) != end) + return true; + + start = tbl->it_reserved_end - tbl->it_offset; + end = tbl->it_size; + return find_next_bit(tbl->it_map, end, start) != end; +} + static void iommu_table_free(struct kref *kref) { struct iommu_table *tbl; @@ -795,10 +803,8 @@ static void iommu_table_free(struct kref *kref) iommu_debugfs_del(tbl); - iommu_table_release_pages(tbl); - /* verify that table contains no entries */ - if (!bitmap_empty(tbl->it_map, tbl->it_size)) + if (iommu_table_in_use(tbl)) pr_warn("%s: Unexpected TCEs\n", __func__); /* free bitmap */ @@ -1099,14 +1105,9 @@ int iommu_take_ownership(struct iommu_table *tbl) for (i = 0; i < tbl->nr_pools; i++) spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock); - iommu_table_release_pages(tbl); - - if (!bitmap_empty(tbl->it_map, tbl->it_size)) { + if (iommu_table_in_use(tbl)) { pr_err("iommu_tce: it_map is not empty"); ret = -EBUSY; - /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */ - iommu_table_reserve_pages(tbl, tbl->it_reserved_start, - tbl->it_reserved_end); } else { memse
[RFC PATCH 6/6] powerpc/microwatt: Stop building the hash MMU code
Microwatt is radix-only, so stop selecting the hash MMU code. This saves 20kB compressed dtbImage and 56kB vmlinux size. Signed-off-by: Nicholas Piggin --- arch/powerpc/configs/microwatt_defconfig | 1 - arch/powerpc/platforms/Kconfig.cputype | 1 + arch/powerpc/platforms/microwatt/Kconfig | 1 - 3 files changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/configs/microwatt_defconfig b/arch/powerpc/configs/microwatt_defconfig index bf5f2e5905eb..6fbad42f9e3d 100644 --- a/arch/powerpc/configs/microwatt_defconfig +++ b/arch/powerpc/configs/microwatt_defconfig @@ -26,7 +26,6 @@ CONFIG_PPC_MICROWATT=y # CONFIG_PPC_OF_BOOT_TRAMPOLINE is not set CONFIG_CPU_FREQ=y CONFIG_HZ_100=y -# CONFIG_PPC_MEM_KEYS is not set # CONFIG_SECCOMP is not set # CONFIG_MQ_IOSCHED_KYBER is not set # CONFIG_COREDUMP is not set diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index 9b90fc501ed4..b9acd6c68c81 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -371,6 +371,7 @@ config SPE config PPC_64S_HASH_MMU bool "Hash MMU Support" depends on PPC_BOOK3S_64 + depends on !PPC_MICROWATT default y select PPC_MM_SLICES help diff --git a/arch/powerpc/platforms/microwatt/Kconfig b/arch/powerpc/platforms/microwatt/Kconfig index e0ff2cfc1ca0..4f0ce292a6ce 100644 --- a/arch/powerpc/platforms/microwatt/Kconfig +++ b/arch/powerpc/platforms/microwatt/Kconfig @@ -6,7 +6,6 @@ config PPC_MICROWATT select PPC_XICS select PPC_ICS_NATIVE select PPC_ICP_NATIVE - select PPC_HASH_MMU_NATIVE if PPC_64S_HASH_MMU select PPC_UDBG_16550 select ARCH_RANDOM help -- 2.23.0
[RFC PATCH 5/6] powerpc/microwatt: select POWER9_CPU
Microwatt implements a subset of ISA v3.0 which is equivalent to the POWER9_CPU selection. Signed-off-by: Nicholas Piggin --- arch/powerpc/configs/microwatt_defconfig | 1 + arch/powerpc/platforms/microwatt/Kconfig | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/powerpc/configs/microwatt_defconfig b/arch/powerpc/configs/microwatt_defconfig index a08b739123da..bf5f2e5905eb 100644 --- a/arch/powerpc/configs/microwatt_defconfig +++ b/arch/powerpc/configs/microwatt_defconfig @@ -14,6 +14,7 @@ CONFIG_EMBEDDED=y # CONFIG_COMPAT_BRK is not set # CONFIG_SLAB_MERGE_DEFAULT is not set CONFIG_PPC64=y +CONFIG_POWER9_CPU=y # CONFIG_PPC_KUEP is not set # CONFIG_PPC_KUAP is not set CONFIG_CPU_LITTLE_ENDIAN=y diff --git a/arch/powerpc/platforms/microwatt/Kconfig b/arch/powerpc/platforms/microwatt/Kconfig index 823192e9d38a..e0ff2cfc1ca0 100644 --- a/arch/powerpc/platforms/microwatt/Kconfig +++ b/arch/powerpc/platforms/microwatt/Kconfig @@ -2,6 +2,7 @@ config PPC_MICROWATT depends on PPC_BOOK3S_64 && !SMP bool "Microwatt SoC platform" + select POWER9_CPU select PPC_XICS select PPC_ICS_NATIVE select PPC_ICP_NATIVE -- 2.23.0
[RFC PATCH 4/6] powerpc/64s: Make hash MMU code build configurable
Introduce a new option CONFIG_PPC_64S_HASH_MMU which allows the 64s hash MMU code to be compiled out if radix is selected and the minimum supported CPU type is POWER9 or higher, and KVM is not selected. This saves 128kB kernel image size (90kB text) on powernv_defconfig minus KVM, 350kB on pseries_defconfig minus KVM, 40kB on a tiny config. Signed-off-by: Nicholas Piggin --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/book3s/64/mmu.h | 22 +- .../include/asm/book3s/64/tlbflush-hash.h | 7 ++ arch/powerpc/include/asm/book3s/pgtable.h | 4 ++ arch/powerpc/include/asm/mmu.h| 38 +-- arch/powerpc/include/asm/mmu_context.h| 2 + arch/powerpc/include/asm/paca.h | 8 +++ arch/powerpc/kernel/asm-offsets.c | 2 + arch/powerpc/kernel/dt_cpu_ftrs.c | 10 ++- arch/powerpc/kernel/entry_64.S| 4 +- arch/powerpc/kernel/exceptions-64s.S | 16 + arch/powerpc/kernel/mce.c | 2 +- arch/powerpc/kernel/mce_power.c | 10 ++- arch/powerpc/kernel/paca.c| 18 ++--- arch/powerpc/kernel/process.c | 13 ++-- arch/powerpc/kernel/prom.c| 2 + arch/powerpc/kernel/setup_64.c| 4 ++ arch/powerpc/kexec/core_64.c | 4 +- arch/powerpc/kexec/ranges.c | 4 ++ arch/powerpc/kvm/Kconfig | 1 + arch/powerpc/mm/book3s64/Makefile | 17 +++-- arch/powerpc/mm/book3s64/hash_pgtable.c | 1 - arch/powerpc/mm/book3s64/hash_utils.c | 10 --- .../{hash_hugetlbpage.c => hugetlbpage.c} | 6 ++ arch/powerpc/mm/book3s64/mmu_context.c| 16 + arch/powerpc/mm/book3s64/pgtable.c| 22 +- arch/powerpc/mm/book3s64/radix_pgtable.c | 4 ++ arch/powerpc/mm/book3s64/slb.c| 16 - arch/powerpc/mm/copro_fault.c | 2 + arch/powerpc/mm/fault.c | 17 + arch/powerpc/mm/pgtable.c | 10 ++- arch/powerpc/platforms/Kconfig.cputype| 20 +- arch/powerpc/platforms/cell/Kconfig | 1 + arch/powerpc/platforms/maple/Kconfig | 1 + arch/powerpc/platforms/microwatt/Kconfig | 2 +- arch/powerpc/platforms/pasemi/Kconfig | 1 + arch/powerpc/platforms/powermac/Kconfig | 1 + arch/powerpc/platforms/powernv/Kconfig| 2 +- arch/powerpc/platforms/powernv/idle.c | 2 + arch/powerpc/platforms/powernv/setup.c| 2 + arch/powerpc/platforms/pseries/lpar.c | 68 ++- arch/powerpc/platforms/pseries/lparcfg.c | 2 +- arch/powerpc/platforms/pseries/mobility.c | 6 ++ arch/powerpc/platforms/pseries/ras.c | 4 ++ arch/powerpc/platforms/pseries/reconfig.c | 2 + arch/powerpc/platforms/pseries/setup.c| 6 +- arch/powerpc/xmon/xmon.c | 8 ++- 47 files changed, 310 insertions(+), 111 deletions(-) rename arch/powerpc/mm/book3s64/{hash_hugetlbpage.c => hugetlbpage.c} (95%) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d01e3401581d..db4e0efd229b 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -940,6 +940,7 @@ config PPC_MEM_KEYS prompt "PowerPC Memory Protection Keys" def_bool y depends on PPC_BOOK3S_64 + depends on PPC_64S_HASH_MMU select ARCH_USES_HIGH_VMA_FLAGS select ARCH_HAS_PKEYS help diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h index c02f42d1031e..857dc88b0043 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu.h +++ b/arch/powerpc/include/asm/book3s/64/mmu.h @@ -98,7 +98,9 @@ typedef struct { * from EA and new context ids to build the new VAs. */ mm_context_id_t id; +#ifdef CONFIG_PPC_64S_HASH_MMU mm_context_id_t extended_id[TASK_SIZE_USER64/TASK_CONTEXT_SIZE]; +#endif }; /* Number of bits in the mm_cpumask */ @@ -110,7 +112,9 @@ typedef struct { /* Number of user space windows opened in process mm_context */ atomic_t vas_windows; +#ifdef CONFIG_PPC_64S_HASH_MMU struct hash_mm_context *hash_context; +#endif void __user *vdso; /* @@ -133,6 +137,7 @@ typedef struct { #endif } mm_context_t; +#ifdef CONFIG_PPC_64S_HASH_MMU static inline u16 mm_ctx_user_psize(mm_context_t *ctx) { return ctx->hash_context->user_psize; @@ -187,11 +192,22 @@ static inline struct subpage_prot_table *mm_ctx_subpage_prot(mm_context_t *ctx) } #endif +#endif + /* * The current system page and segment sizes */ -extern int mmu_linear_psize; +#if defined(CONFIG_PPC_RADIX_MMU) && !defined(CONFIG_PPC_64S_HASH_MMU) +#ifdef CONFIG_PPC_64K_PAGES +#define mmu_virtual_psize MMU_PAGE_64K +#else +#define mmu_virtual_psize MMU_PAGE_4K +
[RFC PATCH 3/6] powerpc/pseries: Stop selecting PPC_HASH_MMU_NATIVE
The pseries platform does not use the native hash code but the PAPR virtualised hash interfaces. Remove PPC_HASH_MMU_NATIVE. This requires moving some tlbiel code from hash_native.c to hash_utils.c. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/book3s/64/tlbflush.h | 4 - arch/powerpc/mm/book3s64/hash_native.c| 104 -- arch/powerpc/mm/book3s64/hash_utils.c | 104 ++ arch/powerpc/platforms/pseries/Kconfig| 1 - 4 files changed, 104 insertions(+), 109 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h index 215973b4cb26..d2e80f178b6d 100644 --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h @@ -14,7 +14,6 @@ enum { TLB_INVAL_SCOPE_LPID = 1, /* invalidate TLBs for current LPID */ }; -#ifdef CONFIG_PPC_NATIVE static inline void tlbiel_all(void) { /* @@ -30,9 +29,6 @@ static inline void tlbiel_all(void) else hash__tlbiel_all(TLB_INVAL_SCOPE_GLOBAL); } -#else -static inline void tlbiel_all(void) { BUG(); } -#endif static inline void tlbiel_all_lpid(bool radix) { diff --git a/arch/powerpc/mm/book3s64/hash_native.c b/arch/powerpc/mm/book3s64/hash_native.c index 52e170bd95ae..83afd08351f1 100644 --- a/arch/powerpc/mm/book3s64/hash_native.c +++ b/arch/powerpc/mm/book3s64/hash_native.c @@ -43,110 +43,6 @@ static DEFINE_RAW_SPINLOCK(native_tlbie_lock); -static inline void tlbiel_hash_set_isa206(unsigned int set, unsigned int is) -{ - unsigned long rb; - - rb = (set << PPC_BITLSHIFT(51)) | (is << PPC_BITLSHIFT(53)); - - asm volatile("tlbiel %0" : : "r" (rb)); -} - -/* - * tlbiel instruction for hash, set invalidation - * i.e., r=1 and is=01 or is=10 or is=11 - */ -static __always_inline void tlbiel_hash_set_isa300(unsigned int set, unsigned int is, - unsigned int pid, - unsigned int ric, unsigned int prs) -{ - unsigned long rb; - unsigned long rs; - unsigned int r = 0; /* hash format */ - - rb = (set << PPC_BITLSHIFT(51)) | (is << PPC_BITLSHIFT(53)); - rs = ((unsigned long)pid << PPC_BITLSHIFT(31)); - - asm volatile(PPC_TLBIEL(%0, %1, %2, %3, %4) -: : "r"(rb), "r"(rs), "i"(ric), "i"(prs), "i"(r) -: "memory"); -} - - -static void tlbiel_all_isa206(unsigned int num_sets, unsigned int is) -{ - unsigned int set; - - asm volatile("ptesync": : :"memory"); - - for (set = 0; set < num_sets; set++) - tlbiel_hash_set_isa206(set, is); - - ppc_after_tlbiel_barrier(); -} - -static void tlbiel_all_isa300(unsigned int num_sets, unsigned int is) -{ - unsigned int set; - - asm volatile("ptesync": : :"memory"); - - /* -* Flush the partition table cache if this is HV mode. -*/ - if (early_cpu_has_feature(CPU_FTR_HVMODE)) - tlbiel_hash_set_isa300(0, is, 0, 2, 0); - - /* -* Now invalidate the process table cache. UPRT=0 HPT modes (what -* current hardware implements) do not use the process table, but -* add the flushes anyway. -* -* From ISA v3.0B p. 1078: -* The following forms are invalid. -* * PRS=1, R=0, and RIC!=2 (The only process-scoped -*HPT caching is of the Process Table.) -*/ - tlbiel_hash_set_isa300(0, is, 0, 2, 1); - - /* -* Then flush the sets of the TLB proper. Hash mode uses -* partition scoped TLB translations, which may be flushed -* in !HV mode. -*/ - for (set = 0; set < num_sets; set++) - tlbiel_hash_set_isa300(set, is, 0, 0, 0); - - ppc_after_tlbiel_barrier(); - - asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT "; isync" : : :"memory"); -} - -void hash__tlbiel_all(unsigned int action) -{ - unsigned int is; - - switch (action) { - case TLB_INVAL_SCOPE_GLOBAL: - is = 3; - break; - case TLB_INVAL_SCOPE_LPID: - is = 2; - break; - default: - BUG(); - } - - if (early_cpu_has_feature(CPU_FTR_ARCH_300)) - tlbiel_all_isa300(POWER9_TLB_SETS_HASH, is); - else if (early_cpu_has_feature(CPU_FTR_ARCH_207S)) - tlbiel_all_isa206(POWER8_TLB_SETS, is); - else if (early_cpu_has_feature(CPU_FTR_ARCH_206)) - tlbiel_all_isa206(POWER7_TLB_SETS, is); - else - WARN(1, "%s called on pre-POWER7 CPU\n", __func__); -} - static inline unsigned long ___tlbie(unsigned long vpn, int psize, int apsize, int ssize) { diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index e35faefa8e59..43c05cdbba73 10064
[RFC PATCH 2/6] powerpc: Rename PPC_NATIVE to PPC_HASH_MMU_NATIVE
PPC_NATIVE now only controls the native HPT code, so rename it to be more descriptive. Restrict it to Book3S only. Signed-off-by: Nicholas Piggin --- arch/powerpc/mm/book3s64/Makefile | 2 +- arch/powerpc/mm/book3s64/hash_utils.c | 2 +- arch/powerpc/platforms/52xx/Kconfig| 2 +- arch/powerpc/platforms/Kconfig | 4 ++-- arch/powerpc/platforms/cell/Kconfig| 2 +- arch/powerpc/platforms/chrp/Kconfig| 2 +- arch/powerpc/platforms/embedded6xx/Kconfig | 2 +- arch/powerpc/platforms/maple/Kconfig | 2 +- arch/powerpc/platforms/microwatt/Kconfig | 2 +- arch/powerpc/platforms/pasemi/Kconfig | 2 +- arch/powerpc/platforms/powermac/Kconfig| 2 +- arch/powerpc/platforms/powernv/Kconfig | 2 +- arch/powerpc/platforms/pseries/Kconfig | 2 +- 13 files changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/mm/book3s64/Makefile b/arch/powerpc/mm/book3s64/Makefile index 1b56d3af47d4..319f4b7f3357 100644 --- a/arch/powerpc/mm/book3s64/Makefile +++ b/arch/powerpc/mm/book3s64/Makefile @@ -6,7 +6,7 @@ CFLAGS_REMOVE_slb.o = $(CC_FLAGS_FTRACE) obj-y += hash_pgtable.o hash_utils.o slb.o \ mmu_context.o pgtable.o hash_tlb.o -obj-$(CONFIG_PPC_NATIVE) += hash_native.o +obj-$(CONFIG_PPC_HASH_MMU_NATIVE) += hash_native.o obj-$(CONFIG_PPC_RADIX_MMU)+= radix_pgtable.o radix_tlb.o obj-$(CONFIG_PPC_4K_PAGES) += hash_4k.o obj-$(CONFIG_PPC_64K_PAGES)+= hash_64k.o diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index ac5720371c0d..e35faefa8e59 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -1091,7 +1091,7 @@ void __init hash__early_init_mmu(void) ps3_early_mm_init(); else if (firmware_has_feature(FW_FEATURE_LPAR)) hpte_init_pseries(); - else if (IS_ENABLED(CONFIG_PPC_NATIVE)) + else if (IS_ENABLED(CONFIG_PPC_HASH_MMU_NATIVE)) hpte_init_native(); if (!mmu_hash_ops.hpte_insert) diff --git a/arch/powerpc/platforms/52xx/Kconfig b/arch/powerpc/platforms/52xx/Kconfig index 99d60acc20c8..b72ed2950ca8 100644 --- a/arch/powerpc/platforms/52xx/Kconfig +++ b/arch/powerpc/platforms/52xx/Kconfig @@ -34,7 +34,7 @@ config PPC_EFIKA bool "bPlan Efika 5k2. MPC5200B based computer" depends on PPC_MPC52xx select PPC_RTAS - select PPC_NATIVE + select PPC_HASH_MMU_NATIVE config PPC_LITE5200 bool "Freescale Lite5200 Eval Board" diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig index e02d29a9d12f..d41dad227de8 100644 --- a/arch/powerpc/platforms/Kconfig +++ b/arch/powerpc/platforms/Kconfig @@ -40,9 +40,9 @@ config EPAPR_PARAVIRT In case of doubt, say Y -config PPC_NATIVE +config PPC_HASH_MMU_NATIVE bool - depends on PPC_BOOK3S_32 || PPC64 + depends on PPC_BOOK3S help Support for running natively on the hardware, i.e. without a hypervisor. This option is not user-selectable but should diff --git a/arch/powerpc/platforms/cell/Kconfig b/arch/powerpc/platforms/cell/Kconfig index cb70c5f25bc6..db4465c51b56 100644 --- a/arch/powerpc/platforms/cell/Kconfig +++ b/arch/powerpc/platforms/cell/Kconfig @@ -8,7 +8,7 @@ config PPC_CELL_COMMON select PPC_DCR_MMIO select PPC_INDIRECT_PIO select PPC_INDIRECT_MMIO - select PPC_NATIVE + select PPC_HASH_MMU_NATIVE select PPC_RTAS select IRQ_EDGE_EOI_HANDLER diff --git a/arch/powerpc/platforms/chrp/Kconfig b/arch/powerpc/platforms/chrp/Kconfig index 9b5c5505718a..ff30ed579a39 100644 --- a/arch/powerpc/platforms/chrp/Kconfig +++ b/arch/powerpc/platforms/chrp/Kconfig @@ -11,6 +11,6 @@ config PPC_CHRP select RTAS_ERROR_LOGGING select PPC_MPC106 select PPC_UDBG_16550 - select PPC_NATIVE + select PPC_HASH_MMU_NATIVE select FORCE_PCI default y diff --git a/arch/powerpc/platforms/embedded6xx/Kconfig b/arch/powerpc/platforms/embedded6xx/Kconfig index 4c6d703a4284..c54786f8461e 100644 --- a/arch/powerpc/platforms/embedded6xx/Kconfig +++ b/arch/powerpc/platforms/embedded6xx/Kconfig @@ -55,7 +55,7 @@ config MVME5100 select FORCE_PCI select PPC_INDIRECT_PCI select PPC_I8259 - select PPC_NATIVE + select PPC_HASH_MMU_NATIVE select PPC_UDBG_16550 help This option enables support for the Motorola (now Emerson) MVME5100 diff --git a/arch/powerpc/platforms/maple/Kconfig b/arch/powerpc/platforms/maple/Kconfig index 86ae210bee9a..7fd84311ade5 100644 --- a/arch/powerpc/platforms/maple/Kconfig +++ b/arch/powerpc/platforms/maple/Kconfig @@ -9,7 +9,7 @@ config PPC_MAPLE select GENERIC_TBSYNC select PPC_UDBG_16550 select PPC_970_NAP - select PPC_NATIVE + select PPC_HASH_
[RFC PATCH 1/6] powerpc: Remove unused FW_FEATURE_NATIVE references
FW_FEATURE_NATIVE_ALWAYS and FW_FEATURE_NATIVE_POSSIBLE are always zero and never do anything. Remove them. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/firmware.h | 8 1 file changed, 8 deletions(-) diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 7604673787d6..df9e4aae917e 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -79,8 +79,6 @@ enum { FW_FEATURE_POWERNV_ALWAYS = 0, FW_FEATURE_PS3_POSSIBLE = FW_FEATURE_LPAR | FW_FEATURE_PS3_LV1, FW_FEATURE_PS3_ALWAYS = FW_FEATURE_LPAR | FW_FEATURE_PS3_LV1, - FW_FEATURE_NATIVE_POSSIBLE = 0, - FW_FEATURE_NATIVE_ALWAYS = 0, FW_FEATURE_POSSIBLE = #ifdef CONFIG_PPC_PSERIES FW_FEATURE_PSERIES_POSSIBLE | @@ -90,9 +88,6 @@ enum { #endif #ifdef CONFIG_PPC_PS3 FW_FEATURE_PS3_POSSIBLE | -#endif -#ifdef CONFIG_PPC_NATIVE - FW_FEATURE_NATIVE_ALWAYS | #endif 0, FW_FEATURE_ALWAYS = @@ -104,9 +99,6 @@ enum { #endif #ifdef CONFIG_PPC_PS3 FW_FEATURE_PS3_ALWAYS & -#endif -#ifdef CONFIG_PPC_NATIVE - FW_FEATURE_NATIVE_ALWAYS & #endif FW_FEATURE_POSSIBLE, -- 2.23.0
[RFC PATCH 0/6] powerpc: Make hash MMU code build configurable
Now that there's a platform that can make good use of it, here's a series that can prevent the hash MMU code being built for 64s platforms that don't need it. Thanks, Nick Nicholas Piggin (6): powerpc: Remove unused FW_FEATURE_NATIVE references powerpc: Rename PPC_NATIVE to PPC_HASH_MMU_NATIVE powerpc/pseries: Stop selecting PPC_HASH_MMU_NATIVE powerpc/64s: Make hash MMU code build configurable powerpc/microwatt: select POWER9_CPU powerpc/microwatt: Stop building the hash MMU code arch/powerpc/Kconfig | 1 + arch/powerpc/configs/microwatt_defconfig | 2 +- arch/powerpc/include/asm/book3s/64/mmu.h | 22 +++- .../include/asm/book3s/64/tlbflush-hash.h | 7 ++ arch/powerpc/include/asm/book3s/64/tlbflush.h | 4 - arch/powerpc/include/asm/book3s/pgtable.h | 4 + arch/powerpc/include/asm/firmware.h | 8 -- arch/powerpc/include/asm/mmu.h| 38 +- arch/powerpc/include/asm/mmu_context.h| 2 + arch/powerpc/include/asm/paca.h | 8 ++ arch/powerpc/kernel/asm-offsets.c | 2 + arch/powerpc/kernel/dt_cpu_ftrs.c | 10 +- arch/powerpc/kernel/entry_64.S| 4 +- arch/powerpc/kernel/exceptions-64s.S | 16 +++ arch/powerpc/kernel/mce.c | 2 +- arch/powerpc/kernel/mce_power.c | 10 +- arch/powerpc/kernel/paca.c| 18 ++- arch/powerpc/kernel/process.c | 13 +- arch/powerpc/kernel/prom.c| 2 + arch/powerpc/kernel/setup_64.c| 4 + arch/powerpc/kexec/core_64.c | 4 +- arch/powerpc/kexec/ranges.c | 4 + arch/powerpc/kvm/Kconfig | 1 + arch/powerpc/mm/book3s64/Makefile | 19 +-- arch/powerpc/mm/book3s64/hash_native.c| 104 arch/powerpc/mm/book3s64/hash_pgtable.c | 1 - arch/powerpc/mm/book3s64/hash_utils.c | 116 -- .../{hash_hugetlbpage.c => hugetlbpage.c} | 6 + arch/powerpc/mm/book3s64/mmu_context.c| 16 +++ arch/powerpc/mm/book3s64/pgtable.c| 22 +++- arch/powerpc/mm/book3s64/radix_pgtable.c | 4 + arch/powerpc/mm/book3s64/slb.c| 16 --- arch/powerpc/mm/copro_fault.c | 2 + arch/powerpc/mm/fault.c | 17 +++ arch/powerpc/mm/pgtable.c | 10 +- arch/powerpc/platforms/52xx/Kconfig | 2 +- arch/powerpc/platforms/Kconfig| 4 +- arch/powerpc/platforms/Kconfig.cputype| 21 +++- arch/powerpc/platforms/cell/Kconfig | 3 +- arch/powerpc/platforms/chrp/Kconfig | 2 +- arch/powerpc/platforms/embedded6xx/Kconfig| 2 +- arch/powerpc/platforms/maple/Kconfig | 3 +- arch/powerpc/platforms/microwatt/Kconfig | 2 +- arch/powerpc/platforms/pasemi/Kconfig | 3 +- arch/powerpc/platforms/powermac/Kconfig | 3 +- arch/powerpc/platforms/powernv/Kconfig| 2 +- arch/powerpc/platforms/powernv/idle.c | 2 + arch/powerpc/platforms/powernv/setup.c| 2 + arch/powerpc/platforms/pseries/Kconfig| 1 - arch/powerpc/platforms/pseries/lpar.c | 68 +- arch/powerpc/platforms/pseries/lparcfg.c | 2 +- arch/powerpc/platforms/pseries/mobility.c | 6 + arch/powerpc/platforms/pseries/ras.c | 4 + arch/powerpc/platforms/pseries/reconfig.c | 2 + arch/powerpc/platforms/pseries/setup.c| 6 +- arch/powerpc/xmon/xmon.c | 8 +- 56 files changed, 427 insertions(+), 240 deletions(-) rename arch/powerpc/mm/book3s64/{hash_hugetlbpage.c => hugetlbpage.c} (95%) -- 2.23.0
Re: [RFC PATCH] powerpc: Investigate static_call concept
On Fri, Aug 27, 2021 at 04:18:47PM +0200, Peter Zijlstra wrote: > On Fri, Aug 27, 2021 at 09:45:37AM +, Christophe Leroy wrote: > > This RFC is to validate the concept of static_call on powerpc. > > > > Highly copied from x86. > > > > It replaces ppc_md.get_irq() which is called at every IRQ, by > > a static call. > > The code looks saner, but does it actually improve performance? I'm > thinking the double branch also isn't free. It isn't, but it is very cheap, while the branch-to-count is not, even *if* it is correctly predicted. > The paranoid in me would've made it: > > BUG_ON(patch_branch(...)); > > just to make sure to notice the target not fitting. Ohh, patch_branch() > doesn't return the create_branch() error, perhaps that wants to be > fixed? Should that be allowed to fail ever? I.e., should a failure be a fatal error? Sounds very fragile otherwise. Segher
Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance
On Fri, 27 Aug 2021 at 16:50, Peter Zijlstra wrote: > > On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote: > > > +/** > > > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can > > > pull tasks > > > + * @dst_cpu: Destination CPU of the load balancing > > > + * @sds: Load-balancing data with statistics of the local group > > > + * @sgs: Load-balancing statistics of the candidate busiest group > > > + * @sg:The candidate busiet group > > > + * > > > + * Check the state of the SMT siblings of both @sds::local and @sg and > > > decide > > > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, > > > it can > > > + * pull tasks if two or more of the SMT siblings of @sg are busy. If > > > only one > > > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority. > > > + * > > > + * If both @dst_cpu and @sg have SMT siblings, even the number of idle > > > CPUs > > > + * between @sds::local and @sg. Thus, pull tasks from @sg if the > > > difference > > > + * between the number of busy CPUs is 2 or more. If the difference is of > > > 1, > > > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT > > > siblings > > > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and > > > @sg > > > + * has lower priority. > > > + */ > > > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds, > > > + struct sg_lb_stats *sgs, > > > + struct sched_group *sg) > > > +{ > > > +#ifdef CONFIG_SCHED_SMT > > > + bool local_is_smt, sg_is_smt; > > > + int sg_busy_cpus; > > > + > > > + local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY; > > > + sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY; > > > + > > > + sg_busy_cpus = sgs->group_weight - sgs->idle_cpus; > > > + > > > + if (!local_is_smt) { > > > + /* > > > +* If we are here, @dst_cpu is idle and does not have SMT > > > +* siblings. Pull tasks if candidate group has two or more > > > +* busy CPUs. > > > +*/ > > > + if (sg_is_smt && sg_busy_cpus >= 2) > > > + return true; > > > + > > > + /* > > > +* @dst_cpu does not have SMT siblings. @sg may have SMT > > > +* siblings and only one is busy. In such case, @dst_cpu > > > +* can help if it has higher priority and is idle. > > > +*/ > > > + return !sds->local_stat.group_util && > > > > sds->local_stat.group_util can't be used to decide if a CPU or group > > of CPUs is idle. util_avg is usually not null when a CPU becomes idle > > and you can have to wait more than 300ms before it becomes Null > > At the opposite, the utilization of a CPU can be null but a task with > > null utilization has just woken up on it. > > Utilization is used to reflect the average work of the CPU or group of > > CPUs but not the current state > > If you want immediate idle, sgs->nr_running == 0 or sgs->idle_cpus == > sgs->group_weight come to mind. yes, I have the same in mind > > > > + sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu); > > > + } > > > + > > > + /* @dst_cpu has SMT siblings. */ > > > + > > > + if (sg_is_smt) { > > > + int local_busy_cpus = sds->local->group_weight - > > > + sds->local_stat.idle_cpus; > > > + int busy_cpus_delta = sg_busy_cpus - local_busy_cpus; > > > + > > > + /* Local can always help to even the number busy CPUs. */ > > > > default behavior of the load balance already tries to even the number a> > of idle CPUs. > > Right, but I suppose this is because we're trapped here and have to deal > with the SMT-SMT case too. Ricardo, can you clarify? IIUC, this function is used in sg_lb_stats to set sgs->group_asym_packing which is then used to set the group state to group_asym_packing and force asym migration. But if we only want to even the number of busy CPUs between the group, we should not need to set the group state to group_asym_packing > > > > + if (busy_cpus_delta >= 2) > > > + return true; > > > + > > > + if (busy_cpus_delta == 1) > > > + return sched_asym_prefer(dst_cpu, > > > +sg->asym_prefer_cpu); > > > + > > > + return false; > > > + } > > > + > > > + /* > > > +* @sg does not have SMT siblings. Ensure that @sds::local does > > > not end > > > +* up with more than one busy SMT sibling and only pull tasks if > > > there > > > +* are not busy CPUs. As CPUs move in and out of idle state > > > frequently, > > > +* also check the group utilization to smoother the decision.
Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance
On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote: > > +/** > > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull > > tasks > > + * @dst_cpu: Destination CPU of the load balancing > > + * @sds: Load-balancing data with statistics of the local group > > + * @sgs: Load-balancing statistics of the candidate busiest group > > + * @sg:The candidate busiet group > > + * > > + * Check the state of the SMT siblings of both @sds::local and @sg and > > decide > > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it > > can > > + * pull tasks if two or more of the SMT siblings of @sg are busy. If only > > one > > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority. > > + * > > + * If both @dst_cpu and @sg have SMT siblings, even the number of idle CPUs > > + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference > > + * between the number of busy CPUs is 2 or more. If the difference is of 1, > > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT > > siblings > > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg > > + * has lower priority. > > + */ > > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds, > > + struct sg_lb_stats *sgs, > > + struct sched_group *sg) > > +{ > > +#ifdef CONFIG_SCHED_SMT > > + bool local_is_smt, sg_is_smt; > > + int sg_busy_cpus; > > + > > + local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY; > > + sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY; > > + > > + sg_busy_cpus = sgs->group_weight - sgs->idle_cpus; > > + > > + if (!local_is_smt) { > > + /* > > +* If we are here, @dst_cpu is idle and does not have SMT > > +* siblings. Pull tasks if candidate group has two or more > > +* busy CPUs. > > +*/ > > + if (sg_is_smt && sg_busy_cpus >= 2) > > + return true; > > + > > + /* > > +* @dst_cpu does not have SMT siblings. @sg may have SMT > > +* siblings and only one is busy. In such case, @dst_cpu > > +* can help if it has higher priority and is idle. > > +*/ > > + return !sds->local_stat.group_util && > > sds->local_stat.group_util can't be used to decide if a CPU or group > of CPUs is idle. util_avg is usually not null when a CPU becomes idle > and you can have to wait more than 300ms before it becomes Null > At the opposite, the utilization of a CPU can be null but a task with > null utilization has just woken up on it. > Utilization is used to reflect the average work of the CPU or group of > CPUs but not the current state If you want immediate idle, sgs->nr_running == 0 or sgs->idle_cpus == sgs->group_weight come to mind. > > + sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu); > > + } > > + > > + /* @dst_cpu has SMT siblings. */ > > + > > + if (sg_is_smt) { > > + int local_busy_cpus = sds->local->group_weight - > > + sds->local_stat.idle_cpus; > > + int busy_cpus_delta = sg_busy_cpus - local_busy_cpus; > > + > > + /* Local can always help to even the number busy CPUs. */ > > default behavior of the load balance already tries to even the number > of idle CPUs. Right, but I suppose this is because we're trapped here and have to deal with the SMT-SMT case too. Ricardo, can you clarify? > > + if (busy_cpus_delta >= 2) > > + return true; > > + > > + if (busy_cpus_delta == 1) > > + return sched_asym_prefer(dst_cpu, > > +sg->asym_prefer_cpu); > > + > > + return false; > > + } > > + > > + /* > > +* @sg does not have SMT siblings. Ensure that @sds::local does not > > end > > +* up with more than one busy SMT sibling and only pull tasks if > > there > > +* are not busy CPUs. As CPUs move in and out of idle state > > frequently, > > +* also check the group utilization to smoother the decision. > > +*/ > > + if (!sds->local_stat.group_util) > > same comment as above about the meaning of group_util == 0 > > > + return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu); > > + > > + return false; > > +#else > > + /* Always return false so that callers deal with non-SMT cases. */ > > + return false; > > +#endif > > +} > > + > > static inline bool > > sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct > > sg_lb_stats *sgs, > >struct sched_group *group) > > { > > + /* Only do SMT checks if either local or candidate have SMT
Re: [PATCH] ppc: add "-z notext" flag to disable diagnostic
Hi! On Sat, Aug 14, 2021 at 12:34:15PM -0700, Fāng-ruì Sòng wrote: > On Sat, Aug 14, 2021 at 5:59 AM Segher Boessenkool > wrote: > > > > On Fri, Aug 13, 2021 at 01:05:08PM -0700, Fangrui Song wrote: > > > Text relocations are considered very awful by linker developers. > > > > By very few linker developers. > https://groups.google.com/g/generic-abi/c/Ckq19PfLxyk/m/uW29sgkoAgAJ > Ali Bahrami: "My opinion is that no one wants text relocations, but > not labeling them if they do exist doesn't seem right. I find the > presence of DF_TEXTREL very interesting when diagnosing various > issues." I don't know who that is, and that post has no context. > https://gcc.gnu.org/legacy-ml/gcc/2016-04/msg00138.html > ( "So why not simply create 'dynamic text relocations' then? Is that > possible with a pure linker change?" ) > Cary Coutant: "Ugh. Besides being a bad idea from a performance point > of view, it's not even always possible to do. Depending on the > architecture, a direct reference from an executable to a variable in a > shared library may not have the necessary reach." That is about a very specific kind of relocation. > binutils-gdb commit "Add linker option: --warn-shared-textrel to > produce warnings when adding a DT_TEXTREL to a shared object." > Nick Clifton That does not say text relocations are bad. Of course you want to know if they are there, for various reasons, like, if they are disallowed on some systems. > https://www.openwall.com/lists/musl/2020/09/26/3 > Szabolcs Nagy: "nice. and gcc passes -z text for static pie code so > that case should not end up with text rels." That does not say text relocations are bad. > Someone wrote "Overall, the overhead of processing text relocations > can cause serious performance degradation." in Solaris' Linker and > Libraries Guide. In process startup, sure. And it can make those processes run faster as well. That is the tradeoff with *all* relocations; you can make any code without any relocations. Relocations are a tradeoff, like most things. > > How would this be a benefit to security? > > https://flameeyes.blog/2016/01/16/textrels-text-relocations-and-their-impact-on-hardening-techniques/ This means that those "hardening techniques" have some serious weaknesses, that is all. And hardening is not part of security anyway; it is impact mitigation. > FWIW I contributed a glibc patch allowing TEXTREL to co-exist with ifunc. > It requires temporary mapping the text segment W^X. What does W^X mean here? It normally means no mapping is both writable and executable at the same time. > > > There are no text relocations, therefore no need for -z notext. > > > > This is a choice by the compiler, nothing more. It saves some process > > startup time, and allows slightly more maps to be shared by processes > > that run the same images. But it is a tradeoff, so it might change; and > > of course it is not an ABI requirement. > Text relocations are generally awful. Great arguments, thanks! :-P > GNU ld and gold's traditional "add DF_TEXTREL on-demand" behavior made > such user errors easy to make. That has no bearing on if text relocations are useful or not. > I understand that kernels are special applications where we apply > relocations once and many user-space objection can be less of a > concern/ignored. > However, the Linux kernel is already in a position where many linker > options are controlled and thus should specify -z notext to make > the intention explicit, or fix the problems (I think x86-64 is good; > that said, powerpc > has a higher cost using PC-relative instructions so pay the oneshot relocation > time cost probably isn't a bad choice) I have no idea what you mean. Segher
Re: [RFC PATCH] powerpc: Investigate static_call concept
On Fri, Aug 27, 2021 at 09:45:37AM +, Christophe Leroy wrote: > This RFC is to validate the concept of static_call on powerpc. > > Highly copied from x86. > > It replaces ppc_md.get_irq() which is called at every IRQ, by > a static call. The code looks saner, but does it actually improve performance? I'm thinking the double branch also isn't free. > When updating the call, we just replace the instruction at the > trampoline address by a relative jump to the function. > > For the time being, the case of out-of-range functions is not handled. The paranoid in me would've made it: BUG_ON(patch_branch(...)); just to make sure to notice the target not fitting. Ohh, patch_branch() doesn't return the create_branch() error, perhaps that wants to be fixed? Did you see the arm64 variant that deals with out-of-range functions in their trampoline? https://lore.kernel.org/linux-arm-kernel/20201120082103.4840-1-a...@kernel.org/ Not exactly 'nice' but supposedly that works. > +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \ > + asm(".pushsection .text, \"ax\" \n" \ > + ".align 4 \n" \ > + ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \ > + STATIC_CALL_TRAMP_STR(name) ": \n" \ > + " blr \n" \ > + ".type " STATIC_CALL_TRAMP_STR(name) ", @function \n" \ > + ".size " STATIC_CALL_TRAMP_STR(name) ", . - " > STATIC_CALL_TRAMP_STR(name) " \n" \ > + ".popsection\n") > + Since you support CALL_NULL_TRAMP, your patch function below: > +void arch_static_call_transform(void *site, void *tramp, void *func, bool > tail) > +{ > + mutex_lock(&text_mutex); > + > + if (tramp) > + patch_branch(tramp, (unsigned long)func, 0); > + > + mutex_unlock(&text_mutex); > +} > +EXPORT_SYMBOL_GPL(arch_static_call_transform); Ought to patch in "blr" when !func to be consistent :-) I'm thinking that your core kernel text all fits in the native range and only modules need out-of-range ?
Re: [PATCH v2 0/2] Kconfig symbol fixes on powerpc
On Thu, 19 Aug 2021 13:39:52 +0200, Lukas Bulwahn wrote: > The script ./scripts/checkkconfigsymbols.py warns on invalid references to > Kconfig symbols (often, minor typos, name confusions or outdated references). > > This patch series addresses all issues reported by > ./scripts/checkkconfigsymbols.py in ./drivers/usb/ for Kconfig and Makefile > files. Issues in the Kconfig and Makefile files indicate some shortcomings in > the overall build definitions, and often are true actionable issues to > address. > > [...] Patch 2 applied to powerpc/fixes. [2/2] powerpc: rectify selection to ARCH_ENABLE_SPLIT_PMD_PTLOCK https://git.kernel.org/powerpc/c/310d2e83cb9b7f1e7232319880e3fcb57592fa10 cheers
Re: [PATCH v1] powerpc/64s: Fix scv implicit soft-mask table for relocated kernels
On Fri, 20 Aug 2021 20:34:31 +1000, Nicholas Piggin wrote: > The implict soft-mask table addresses get relocated if they use a > relative symbol like a label. This is right for code that runs relocated > but not for unrelocated. The scv interrupt vectors run unrelocated, so > absolute addresses are required for their soft-mask table entry. > > This fixes crashing with relocated kernels, usually an asynchronous > interrupt hitting in the scv handler, then hitting the trap that checks > whether r1 is in userspace. > > [...] Applied to powerpc/fixes. [1/1] powerpc/64s: Fix scv implicit soft-mask table for relocated kernels https://git.kernel.org/powerpc/c/787c70f2f9990b5a197320152d2fc32cd8a6ad1a cheers
Re: [PATCH] powerpc/xive: Do not mark xive_request_ipi() as __init
On Mon, 16 Aug 2021 11:57:11 -0700, Nathan Chancellor wrote: > Compiling ppc64le_defconfig with clang-14 shows a modpost warning: > > WARNING: modpost: vmlinux.o(.text+0xa74e0): Section mismatch in > reference from the function xive_setup_cpu_ipi() to the function > .init.text:xive_request_ipi() > The function xive_setup_cpu_ipi() references > the function __init xive_request_ipi(). > This is often because xive_setup_cpu_ipi lacks a __init > annotation or the annotation of xive_request_ipi is wrong. > > [...] Applied to powerpc/fixes. [1/1] powerpc/xive: Do not mark xive_request_ipi() as __init https://git.kernel.org/powerpc/c/3f78c90f9eb2e228f44ecc8f4377753f0e11dbab cheers
Re: [PATCH v2] powerpc/mm: Fix set_memory_*() against concurrent accesses
On Wed, 18 Aug 2021 22:05:18 +1000, Michael Ellerman wrote: > Laurent reported that STRICT_MODULE_RWX was causing intermittent crashes > on one of his systems: > > kernel tried to execute exec-protected page (c00804073278) - exploit > attempt? (uid: 0) > BUG: Unable to handle kernel instruction fetch > Faulting instruction address: 0xc00804073278 > Oops: Kernel access of bad area, sig: 11 [#1] > LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks ... > CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12 > Workqueue: events control_work_handler [virtio_console] > NIP: c00804073278 LR: c00804073278 CTR: c01e9de0 > REGS: c0002e4ef7e0 TRAP: 0400 Not tainted (5.14.0-rc4+) > MSR: 80004280b033 CR: 24002822 XER: > 200400cf > ... > NIP fill_queue+0xf0/0x210 [virtio_console] > LR fill_queue+0xf0/0x210 [virtio_console] > Call Trace: > fill_queue+0xb4/0x210 [virtio_console] (unreliable) > add_port+0x1a8/0x470 [virtio_console] > control_work_handler+0xbc/0x1e8 [virtio_console] > process_one_work+0x290/0x590 > worker_thread+0x88/0x620 > kthread+0x194/0x1a0 > ret_from_kernel_thread+0x5c/0x64 > > [...] Applied to powerpc/fixes. [1/1] powerpc/mm: Fix set_memory_*() against concurrent accesses https://git.kernel.org/powerpc/c/9f7853d7609d59172eecfc5e7ccf503bc1b690bd cheers
Re: [PATCH v2] powerpc/32s: Fix random crashes by adding isync() after locking/unlocking KUEP
On Wed, 18 Aug 2021 06:49:29 + (UTC), Christophe Leroy wrote: > Commit b5efec00b671 ("powerpc/32s: Move KUEP locking/unlocking in C") > removed the 'isync' instruction after adding/removing NX bit in user > segments. The reasoning behind this change was that when setting the > NX bit we don't mind it taking effect with delay as the kernel never > executes text from userspace, and when clearing the NX bit this is > to return to userspace and then the 'rfi' should synchronise the > context. > > [...] Applied to powerpc/fixes. [1/1] powerpc/32s: Fix random crashes by adding isync() after locking/unlocking KUEP https://git.kernel.org/powerpc/c/ef486bf448a057a6e2d50e40ae879f7add6585da cheers
Re: [PATCH] powerpc/pseries: Fix build error when NUMA=n
Le 27/08/2021 à 15:15, Michael Ellerman a écrit : On Mon, 16 Aug 2021 14:10:32 +1000, Michael Ellerman wrote: As reported by lkp, if NUMA=n we see a build error: arch/powerpc/platforms/pseries/hotplug-cpu.c: In function 'pseries_cpu_hotplug_init': arch/powerpc/platforms/pseries/hotplug-cpu.c:1022:8: error: 'node_to_cpumask_map' undeclared 1022 |node_to_cpumask_map[node]); Use cpumask_of_node() which has an empty stub for NUMA=n, and when NUMA=y does a lookup from node_to_cpumask_map[]. [...] Applied to powerpc/next. [1/1] powerpc/pseries: Fix build error when NUMA=n https://git.kernel.org/powerpc/c/8b893ef190b0c440877de04f767efca4bf4d6af8 cheers Thanks, Michael, for fixing my bugs !
Re: [PATCH -next] selftests/powerpc: Remove duplicated include from tm-poison.c
On Fri, 26 Mar 2021 14:48:08 +0800, Zheng Yongjun wrote: > Remove duplicated include. > > > > Applied to powerpc/next. [1/1] selftests/powerpc: Remove duplicated include from tm-poison.c https://git.kernel.org/powerpc/c/6af0b5570b59ce8dd1608a8e48f59eff3f4bdd04 cheers
Re: [PATCH] [v2] arch: powerpc: Remove duplicate includes
On Tue, 23 Mar 2021 14:29:05 +0800, Wan Jiabing wrote: > mmu-hash.h: asm/bug.h has been included at line 12, so remove > the duplicate one at line 21. > interrupt.c: asm/interrupt.h has been included at line 12, so > remove the duplicate one at line 10. > time.c: linux/sched/clock.h has been included at line 33,so > remove the duplicate one at line 56 and move sched/cputime.h > under sched including segament. > > [...] Applied to powerpc/next. [1/1] arch: powerpc: Remove duplicate includes https://git.kernel.org/powerpc/c/e225c4d6bc389701f2f63fc246420a1da3465ab5 cheers
Re: [PATCH v2 0/4] Some improvements on regs usage
On Sat, 7 Aug 2021 09:02:35 +0800, sxwj...@me.com wrote: > From: Xiongwei Song > > When CONFIG_4xx=y or CONFIG_BOOKE=y, currently in code we reference dsisr > to get interrupt reasons and reference dar to get excepiton address. > However, in reference manuals, esr is used for interrupt reasons and dear > is used for excepiton address, so the patchset changes dsisr -> esr, > dar -> dear for CONFIG_4xx=y or CONFIG_BOOKE=y. > > [...] Applied to powerpc/next. [1/4] powerpc: Optimize register usage for esr register https://git.kernel.org/powerpc/c/4f8e78c0757e3c5a65d9d8ac76e2434c71a78f5a [2/4] powerpc/64e: Get esr offset with _ESR macro https://git.kernel.org/powerpc/c/cfa47772ca8d53d7a6c9b331a7f6e7c4c9827214 [3/4] powerpc: Optimize register usage for dear register https://git.kernel.org/powerpc/c/4872cbd0ca35ca5b20d52e2539e7e1950f126e7b [4/4] powerpc/64e: Get dear offset with _DEAR macro https://git.kernel.org/powerpc/c/d9db6e420268b2d561731468a31f0b15e2e9a145 cheers
Re: [PATCH] powerpc/head_check: use stdout for error messages
On Sun, 15 Aug 2021 15:23:34 -0700, Randy Dunlap wrote: > Prefer stderr instead of stdout for error messages. > This is a good practice and can help CI error detecting and > reporting (0day in this case). > > > > > [...] Applied to powerpc/next. [1/1] powerpc/head_check: use stdout for error messages https://git.kernel.org/powerpc/c/47c258d71ebfc832a760a1dc6540cf3c33968023 cheers
Re: (subset) [PATCH v2 00/60] KVM: PPC: Book3S HV P9: entry/exit optimisations
On Thu, 12 Aug 2021 02:00:34 +1000, Nicholas Piggin wrote: > This reduces radix guest full entry/exit latency on POWER9 and POWER10 > by 2x. > > Nested HV guests should see smaller improvements in their L1 entry/exit, > but this is also combined with most L0 speedups also applying to nested > entry. nginx localhost throughput test in a SMP nested guest is improved > about 10% (in a direct guest it doesn't change much because it uses XIVE > for IPIs) when L0 and L1 are patched. > > [...] Applied to powerpc/next. [01/60] KVM: PPC: Book3S HV: Initialise vcpu MSR with MSR_ME https://git.kernel.org/powerpc/c/fd42b7b09c602c904452c0c3e5955ca21d8e387a [02/60] KVM: PPC: Book3S HV: Remove TM emulation from POWER7/8 path https://git.kernel.org/powerpc/c/daac40e8d7a63ab8608132a7cfce411986feac32 [03/60] KVM: PPC: Book3S HV P9: Fixes for TM softpatch interrupt NIP https://git.kernel.org/powerpc/c/4782e0cd0d184d727ad3b0cfe20d1d44d9f98239 [04/60] KVM: PPC: Book3S HV Nested: Fix TM softpatch HFAC interrupt emulation https://git.kernel.org/powerpc/c/d82b392d9b3556b63e3f9916cf057ea847e173a9 [05/60] KVM: PPC: Book3S HV Nested: Sanitise vcpu registers https://git.kernel.org/powerpc/c/7487cabc7ed2f7716bf304e4e97c57fd995cf70e [06/60] KVM: PPC: Book3S HV Nested: Make nested HFSCR state accessible https://git.kernel.org/powerpc/c/8b210a880b35ba75eb42b79dfd65e369c1feb119 [07/60] KVM: PPC: Book3S HV Nested: Stop forwarding all HFUs to L1 https://git.kernel.org/powerpc/c/7c3ded5735141ff4d049747c9f76672a8b737c49 [08/60] KVM: PPC: Book3S HV Nested: save_hv_return_state does not require trap argument https://git.kernel.org/powerpc/c/f2e29db156523bf08a0524e0f4306a641912c6d9 [09/60] KVM: PPC: Book3S HV Nested: Reflect guest PMU in-use to L0 when guest SPRs are live https://git.kernel.org/powerpc/c/1782663897945a5cf28e564ba5eed730098e9aa4 [10/60] powerpc/64s: Remove WORT SPR from POWER9/10 https://git.kernel.org/powerpc/c/0c8fb653d487d2873f5eefdcaf4cecff4e103828 cheers
Re: [PATCH] powerpc/pseries: Fix build error when NUMA=n
On Mon, 16 Aug 2021 14:10:32 +1000, Michael Ellerman wrote: > As reported by lkp, if NUMA=n we see a build error: > >arch/powerpc/platforms/pseries/hotplug-cpu.c: In function > 'pseries_cpu_hotplug_init': >arch/powerpc/platforms/pseries/hotplug-cpu.c:1022:8: error: > 'node_to_cpumask_map' undeclared > 1022 |node_to_cpumask_map[node]); > > Use cpumask_of_node() which has an empty stub for NUMA=n, and when > NUMA=y does a lookup from node_to_cpumask_map[]. > > [...] Applied to powerpc/next. [1/1] powerpc/pseries: Fix build error when NUMA=n https://git.kernel.org/powerpc/c/8b893ef190b0c440877de04f767efca4bf4d6af8 cheers
Re: [PATCH v2] ppc: add "-z notext" flag to disable diagnostic
On Fri, 13 Aug 2021 13:05:11 -0700, Bill Wendling wrote: > Object files used to link .tmp_vmlinux.kallsyms1 have many R_PPC64_ADDR64 > relocations in non-SHF_WRITE sections. There are many text relocations (e.g. > in > .rela___ksymtab_gpl+* and .rela__mcount_loc sections) in a -pie link and are > disallowed by LLD: > > ld.lld: error: can't create dynamic relocation R_PPC64_ADDR64 against local > symbol in readonly segment; recompile object files with -fPIC or pass > '-Wl,-z,notext' to allow text relocations in the output > >>> defined in arch/powerpc/kernel/head_64.o > >>> referenced by arch/powerpc/kernel/head_64.o:(__restart_table+0x10) > > [...] Applied to powerpc/next. [1/1] ppc: add "-z notext" flag to disable diagnostic https://git.kernel.org/powerpc/c/0355785313e2191be4e1108cdbda94ddb0238c48 cheers
Re: [PATCH v2 0/2] Kconfig symbol fixes on powerpc
On Thu, 19 Aug 2021 13:39:52 +0200, Lukas Bulwahn wrote: > The script ./scripts/checkkconfigsymbols.py warns on invalid references to > Kconfig symbols (often, minor typos, name confusions or outdated references). > > This patch series addresses all issues reported by > ./scripts/checkkconfigsymbols.py in ./drivers/usb/ for Kconfig and Makefile > files. Issues in the Kconfig and Makefile files indicate some shortcomings in > the overall build definitions, and often are true actionable issues to > address. > > [...] Patch 1 applied to powerpc/next. [1/2] powerpc: kvm: remove obsolete and unneeded select https://git.kernel.org/powerpc/c/c26d4c5d4f0df7207da3975458261927f9305465 cheers
Re: [PATCH v4 1/3] powerpc/perf: Use stack siar instead of mfspr
On Wed, 18 Aug 2021 22:45:54 +0530, Kajol Jain wrote: > Minor optimization in the 'perf_instruction_pointer' function code by > making use of stack siar instead of mfspr. > > > > Applied to powerpc/next. [1/3] powerpc/perf: Use stack siar instead of mfspr https://git.kernel.org/powerpc/c/b1643084d164cea0c107a39bcdf0119fc52619af [2/3] powerpc/perf: Drop the case of returning 0 as instruction pointer https://git.kernel.org/powerpc/c/cc90c6742ef5b438f4cb86029d7a794bd0a44a06 [3/3] powerpc/perf: Fix the check for SIAR value https://git.kernel.org/powerpc/c/3c69a5f3fa3e312689ec218b5059784d49d7 cheers
Re: [PATCH] powerpc/perf/hv-gpci: Fix the logic to compute counter value from the hcall result buffer.
On Fri, 13 Aug 2021 13:51:58 +0530, Kajol Jain wrote: > H_GetPerformanceCounterInfo (0xF080) hcall returns the counter data in the > result buffer. Result buffer has specific format defined in the PAPR > specification. One of the field is counter offset and width of the counter > data returned. > > Counter data are returned in a unsigned char array. To > get the final counter data, these values should be left shifted > byte at a time. But commit 220a0c609ad17 ("powerpc/perf: Add support > for the hv gpci (get performance counter info) interface") made the > shifting bitwise. Because of this, hcall counters values could end up > in lower side, which messes the counter prev vs now calculation. This > lead to huge counter value reporting > > [...] Applied to powerpc/next. [1/1] powerpc/perf/hv-gpci: Fix the logic to compute counter value from the hcall result buffer. https://git.kernel.org/powerpc/c/f9addd85fbfacf0d155e83dbee8696d6df5ed0c7 cheers
Re: [PATCH v2 0/3] powerpc: mpc855_ads defconfig fixes
On Tue, 17 Aug 2021 14:24:04 +0930, Joel Stanley wrote: > v2: fix typos, split out mtd fix from savedefconfig patch > > The first patch fixes a build warning I noticed when testing something > unrelated. > > The second fixes a regression where the MTD partition support dropped out > of the config. I have enabled the dependency so this is still part of > the config. > > [...] Applied to powerpc/next. [1/3] powerpc/config: Fix IPV6 warning in mpc855_ads https://git.kernel.org/powerpc/c/c5ac55b6cbc604ad4144c380feae20f69453df91 [2/3] powerpc/config: Renable MTD_PHYSMAP_OF https://git.kernel.org/powerpc/c/d0e28a6145c3455b69991245e7f6147eb914b34a [3/3] powerpc/configs: Regenerate mpc885_ads_defconfig https://git.kernel.org/powerpc/c/87e0d46bf68913f4c87dba94aadc00da658a874b cheers
Re: [PATCH v2 1/2] selftests/powerpc: Add missing clobbered register to to ptrace TM tests
On Thu, 29 Jul 2021 14:13:16 +1000, Jordan Niethe wrote: > ISA v3.1 removes TM but includes a synthetic implementation for > backwards compatibility. With this implementation, the tests > ptrace-tm-spd-gpr and ptrace-tm-gpr should never be able to make any > forward progress and eventually should be killed by the timeout. > Instead on a P10 running in P9 mode, ptrace_tm_gpr fails like so: > > test: ptrace_tm_gpr > tags: git_version:unknown > Starting the child > ... > ... > GPR[27]: 1 Expected: 2 > GPR[28]: 1 Expected: 2 > GPR[29]: 1 Expected: 2 > GPR[30]: 1 Expected: 2 > GPR[31]: 1 Expected: 2 > [FAIL] Test FAILED on line 98 > failure: ptrace_tm_gpr > selftests: ptrace-tm-gpr [FAIL] > > [...] Applied to powerpc/next. [1/2] selftests/powerpc: Add missing clobbered register to to ptrace TM tests https://git.kernel.org/powerpc/c/c95278a0534449efc64ac8169382bce217963be2 [2/2] selftests: Skip TM tests on synthetic TM implementations https://git.kernel.org/powerpc/c/e42edf9b9d126bb1c743f2e7984877ba27f09fe7 cheers
Re: [PATCH] powerpc/tau: Add 'static' storage qualifier to 'tau_work' definition
On Thu, 19 Aug 2021 10:46:54 +1000, Finn Thain wrote: > This patch prevents the following sparse warning. > > arch/powerpc/kernel/tau_6xx.c:199:1: sparse: sparse: symbol 'tau_work' > was not declared. Should it be static? > > > > [...] Applied to powerpc/next. [1/1] powerpc/tau: Add 'static' storage qualifier to 'tau_work' definition https://git.kernel.org/powerpc/c/6cd717fe9b3a787f8e8f9d0bc9b7634a9c104b3e cheers
Re: [PATCH v2 0/3] KVM: PPC: Book3S HV: kvmhv_copy_tofrom_guest_radix changes
On Thu, 5 Aug 2021 18:26:13 -0300, Fabiano Rosas wrote: > This series contains the fix for __kvmhv_copy_tofrom_guest_radix plus > a couple of changes. > > - Patch 1: The fix itself. I thought a smaller fix upfront would be >better to facilitate any backports. > > - Patch 2: Adds a sanity check to the low level function. Since the >hcall API already enforces that the effective address must >be on quadrant 0, moving the checks from the high level >function would only add overhead to the hcall path, so I >opted for a lightweight check at the low level. > > [...] Applied to powerpc/next. [1/3] KVM: PPC: Book3S HV: Fix copy_tofrom_guest routines https://git.kernel.org/powerpc/c/5d7d6dac8fe99ed59eee2300e4a03370f94d5222 [2/3] KVM: PPC: Book3S HV: Add sanity check to copy_tofrom_guest https://git.kernel.org/powerpc/c/c232461c0c3b0aca637f0d7658a7f5d0bb393a4e [3/3] KVM: PPC: Book3S HV: Stop exporting symbols from book3s_64_mmu_radix https://git.kernel.org/powerpc/c/0eb596f1e6103ebe122792a425b88c5dc21c4087 cheers
Re: [PATCH v2 0/2] W=1 fixes
On Mon, 23 Aug 2021 11:00:37 +0200, Cédric Le Goater wrote: > These are the remaining patches needed to compile the ppc kernel with > W=1. Audit issues are now being addressed by Christophe in patch : > > [v2] powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC > http://patchwork.ozlabs.org/project/linuxppc-dev/patch/dc14509a28a993738b1325211f412be72a4f9b1e.1629701132.git.christophe.le...@csgroup.eu/ > > Thanks, > > [...] Applied to powerpc/next. [1/2] powerpc/prom: Fix unused variable ‘reserve_map’ when CONFIG_PPC32 is not set https://git.kernel.org/powerpc/c/3accc0faef081b6813967b34f7d05a3edb855cbd [2/2] powerpc/compat_sys: Declare syscalls https://git.kernel.org/powerpc/c/cc47ad409ba9cc950e9c492c8ba653dabd392148 cheers
Re: [PATCH 0/6] W=1 fixes
On Thu, 19 Aug 2021 14:56:50 +0200, Cédric Le Goater wrote: > With this small series, I could compile the ppc kernel with W=1. There > are certainly other configs which will need more fixes but it's a good > start. > > The last 2 patches look hacky. Christophe, could you help with these > to find a better place to include the declarations ? > > [...] Patches 2-4 applied to powerpc/next. [2/6] powerpc/pseries/vas: Declare pseries_vas_fault_thread_fn() as static https://git.kernel.org/powerpc/c/4cb266074aa17e9cafed3a92e9f43b161516569f [3/6] KVM: PPC: Book3S PR: Declare kvmppc_handle_exit_pr() https://git.kernel.org/powerpc/c/cb53a93e33e108bfec00a13cd12696e1c0ccc8b6 [4/6] KVM: PPC: Book3S PR: Remove unused variable https://git.kernel.org/powerpc/c/b352ddae7b2ccd2cb29f495ca0a7c9b6848b623f cheers
Re: [PATCH v4 1/4] powerpc/ptdump: Use DEFINE_SHOW_ATTRIBUTE()
On Thu, 8 Jul 2021 16:49:40 + (UTC), Christophe Leroy wrote: > Use DEFINE_SHOW_ATTRIBUTE() instead of open coding > open() and fops. > > > > Applied to powerpc/next. [1/4] powerpc/ptdump: Use DEFINE_SHOW_ATTRIBUTE() https://git.kernel.org/powerpc/c/11f27a7fa4ca27935de74e3eb052bdc430d5f8d8 [2/4] powerpc/ptdump: Remove unused 'page_size' parameter https://git.kernel.org/powerpc/c/64b87b0c70e0fd28352895cba3c0a9631e0072dd [3/4] powerpc/ptdump: Reduce level numbers by 1 in note_page() and add p4d level https://git.kernel.org/powerpc/c/cf98d2b6eea6a1b2c43f85680ad58fcc3ea9496b [4/4] powerpc/ptdump: Convert powerpc to GENERIC_PTDUMP https://git.kernel.org/powerpc/c/e084728393a58e7fdafeee2e6b20e0faff09b557 cheers
Re: [PATCH v3] powerpc/booke: Avoid link stack corruption in several places
On Tue, 24 Aug 2021 07:56:26 + (UTC), Christophe Leroy wrote: > Use bcl 20,31,+4 instead of bl in order to preserve link stack. > > See commit c974809a26a1 ("powerpc/vdso: Avoid link stack corruption > in __get_datapage()") for details. > > > > [...] Applied to powerpc/next. [1/1] powerpc/booke: Avoid link stack corruption in several places https://git.kernel.org/powerpc/c/f5007dbf4da729baa850b33a64dc3cc53757bdf8 cheers
Re: (subset) [PATCH v3 1/3] powerpc: Remove MSR_PR check in interrupt_exit_{user/kernel}_prepare()
On Mon, 23 Aug 2021 08:24:20 + (UTC), Christophe Leroy wrote: > In those hot functions that are called at every interrupt, any saved > cycle is worth it. > > interrupt_exit_user_prepare() and interrupt_exit_kernel_prepare() are > called from three places: > - From entry_32.S > - From interrupt_64.S > - From interrupt_exit_user_restart() and interrupt_exit_kernel_restart() > > [...] Applied to powerpc/next. [2/3] powerpc: Refactor verification of MSR_RI https://git.kernel.org/powerpc/c/806c0e6e7e97adc17389c8dc1f52d4736f49299b cheers
Re: [PATCH v2] powerpc: Avoid link stack corruption in misc asm functions
On Tue, 24 Aug 2021 07:56:35 + (UTC), Christophe Leroy wrote: > bl;mflr is used at several places to get code position. > > Use bcl 20,31,+4 instead of bl in order to preserve link stack. > > See commit c974809a26a1 ("powerpc/vdso: Avoid link stack corruption > in __get_datapage()") for details. > > [...] Applied to powerpc/next. [1/1] powerpc: Avoid link stack corruption in misc asm functions https://git.kernel.org/powerpc/c/33e1402435cb9f3021439a15935ea2dc69ec1844 cheers
Re: [PATCH v1 1/2] powerpc: Use lwarx/ldarx directly instead of PPC_LWARX/LDARX macros
On Tue, 2 Mar 2021 08:48:11 + (UTC), Christophe Leroy wrote: > Force the eh flag at 0 on PPC32. > Patch 1 applied to powerpc/next. [1/2] powerpc: Use lwarx/ldarx directly instead of PPC_LWARX/LDARX macros https://git.kernel.org/powerpc/c/9401f4e46cf6965e23738f70e149172344a01eef cheers
Re: [PATCH] powerpc/syscalls: Simplify do_mmap2()
On Fri, 25 Jun 2021 10:58:33 + (UTC), Christophe Leroy wrote: > When shift is nul, operations remain valid so no test needed. > > And 'ret' is unnecessary. > > And use IS_ALIGNED() to check alignment, that's more clear. > > > [...] Applied to powerpc/next. [1/1] powerpc/syscalls: Simplify do_mmap2() https://git.kernel.org/powerpc/c/316389e904f968d24d44cd96a6d171ee1ef269cf cheers
Re: [PATCH] powerpc/syscalls: Remove __NR__exit
On Mon, 23 Aug 2021 06:45:20 + (UTC), Christophe Leroy wrote: > __NR_exit is nowhere used. On most architectures it was removed by > commit 135ab6ec8fda ("[PATCH] remove remaining errno and > __KERNEL_SYSCALLS__ references") but not on powerpc. > > powerpc removed __KERNEL_SYSCALLS__ in commit 3db03b4afb3e ("[PATCH] > rename the provided execve functions to kernel_execve"), but __NR_exit > was left over. > > [...] Applied to powerpc/next. [1/1] powerpc/syscalls: Remove __NR__exit https://git.kernel.org/powerpc/c/a00ea5b6f2bbef8b004b0b7228c61680a50c7c3f cheers
Re: [PATCH] powerpc/ptrace: Make user_mode() common to PPC32 and PPC64
On Tue, 17 Aug 2021 16:00:14 + (UTC), Christophe Leroy wrote: > Today we have: > > #ifdef __powerpc64__ > #define user_mode(regs) regs)->msr) >> MSR_PR_LG) & 0x1) > #else > #define user_mode(regs) (((regs)->msr & MSR_PR) != 0) > #endif > > [...] Applied to powerpc/next. [1/1] powerpc/ptrace: Make user_mode() common to PPC32 and PPC64 https://git.kernel.org/powerpc/c/19e932eb6ea47f4f37513eb2ae0daee19117765c cheers
Re: [PATCH] powerpc/32: indirect function call use bctrl rather than blrl in ret_from_kernel_thread
On Fri, 20 Aug 2021 05:16:05 + (UTC), Christophe Leroy wrote: > Copied from commit 89bbe4c798bc ("powerpc/64: indirect function call > use bctrl rather than blrl in ret_from_kernel_thread") > > blrl is not recommended to use as an indirect function call, as it may > corrupt the link stack predictor. > > This is not a performance critical path but this should be fixed for > consistency. > > [...] Applied to powerpc/next. [1/1] powerpc/32: indirect function call use bctrl rather than blrl in ret_from_kernel_thread https://git.kernel.org/powerpc/c/113ec9ccc8049c3772f0eab46b62c5d6654c09f7 cheers
Re: [PATCH] powerpc/audit: Simplify syscall_get_arch()
On Fri, 20 Aug 2021 09:39:14 + (UTC), Christophe Leroy wrote: > Make use of is_32bit_task() and CONFIG_CPU_LITTLE_ENDIAN > to simplify syscall_get_arch(). > > > > Applied to powerpc/next. [1/1] powerpc/audit: Simplify syscall_get_arch() https://git.kernel.org/powerpc/c/770cec16cdc9d1e67896dd2214a4fec9a3fa cheers
Re: [PATCH] powerpc/audit: Avoid unneccessary #ifdef in syscall_get_arguments()
On Fri, 20 Aug 2021 09:28:19 + (UTC), Christophe Leroy wrote: > Use is_32bit_task() which already handles CONFIG_COMPAT. > > > > Applied to powerpc/next. [1/1] powerpc/audit: Avoid unneccessary #ifdef in syscall_get_arguments() https://git.kernel.org/powerpc/c/898a1ef06ad4a2a8e3c9490ce62624fcd1a7b1f8 cheers
Re: [PATCH] powerpc/32: Remove unneccessary calculations in load_up_{fpu/altivec}
On Wed, 18 Aug 2021 08:47:28 + (UTC), Christophe Leroy wrote: > No need to re-read SPRN_THREAD, we can calculate thread address > from current (r2). > > And remove a reload of value 1 into r4 as r4 is already 1. > > > > [...] Applied to powerpc/next. [1/1] powerpc/32: Remove unneccessary calculations in load_up_{fpu/altivec} https://git.kernel.org/powerpc/c/51ed00e71f0130e0f3534b8e7d78facd16829426 cheers
Re: [PATCH] powerpc/doc: Fix htmldocs errors
On Wed, 25 Aug 2021 09:54:47 +0530, Aneesh Kumar K.V wrote: > Fix make htmldocs related errors with the newly added associativity.rst > doc file. > > > > Applied to powerpc/next. [1/1] powerpc/doc: Fix htmldocs errors https://git.kernel.org/powerpc/c/f50da6edbf1ebf35dd8070847bfab5cb988d472b cheers
Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance
On Tue, 10 Aug 2021 at 16:41, Ricardo Neri wrote: > > When deciding to pull tasks in ASYM_PACKING, it is necessary not only to > check for the idle state of the destination CPU, dst_cpu, but also of > its SMT siblings. > > If dst_cpu is idle but its SMT siblings are busy, performance suffers > if it pulls tasks from a medium priority CPU that does not have SMT > siblings. > > Implement asym_smt_can_pull_tasks() to inspect the state of the SMT > siblings of both dst_cpu and the CPUs in the candidate busiest group. > > Cc: Aubrey Li > Cc: Ben Segall > Cc: Daniel Bristot de Oliveira > Cc: Dietmar Eggemann > Cc: Mel Gorman > Cc: Quentin Perret > Cc: Rafael J. Wysocki > Cc: Srinivas Pandruvada > Cc: Steven Rostedt > Cc: Tim Chen > Reviewed-by: Joel Fernandes (Google) > Reviewed-by: Len Brown > Signed-off-by: Ricardo Neri > --- > Changes since v3: > * Removed the arch_asym_check_smt_siblings() hook. Discussions with the > powerpc folks showed that this patch should not impact them. Also, more > recent powerpc processor no longer use asym_packing. (PeterZ) > * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar) > * Removed unnecessary check for local CPUs when the local group has zero > utilization. (Joel) > * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect > the fact that it deals with SMT cases. > * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so > that callers can deal with non-SMT cases. > > Changes since v2: > * Reworded the commit message to reflect updates in code. > * Corrected misrepresentation of dst_cpu as the CPU doing the load > balancing. (PeterZ) > * Removed call to arch_asym_check_smt_siblings() as it is now called in > sched_asym(). > > Changes since v1: > * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull > tasks. Instead, reclassify the candidate busiest group, as it > may still be selected. (PeterZ) > * Avoid an expensive and unnecessary call to cpumask_weight() when > determining if a sched_group is comprised of SMT siblings. > (PeterZ). > --- > kernel/sched/fair.c | 95 + > 1 file changed, 95 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index dd411cefb63f..8a1a2a43732c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8531,10 +8531,99 @@ group_type group_classify(unsigned int imbalance_pct, > return group_has_spare; > } > > +/** > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull > tasks > + * @dst_cpu: Destination CPU of the load balancing > + * @sds: Load-balancing data with statistics of the local group > + * @sgs: Load-balancing statistics of the candidate busiest group > + * @sg:The candidate busiet group > + * > + * Check the state of the SMT siblings of both @sds::local and @sg and decide > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can > + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority. > + * > + * If both @dst_cpu and @sg have SMT siblings, even the number of idle CPUs > + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference > + * between the number of busy CPUs is 2 or more. If the difference is of 1, > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT > siblings > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg > + * has lower priority. > + */ > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds, > + struct sg_lb_stats *sgs, > + struct sched_group *sg) > +{ > +#ifdef CONFIG_SCHED_SMT > + bool local_is_smt, sg_is_smt; > + int sg_busy_cpus; > + > + local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY; > + sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY; > + > + sg_busy_cpus = sgs->group_weight - sgs->idle_cpus; > + > + if (!local_is_smt) { > + /* > +* If we are here, @dst_cpu is idle and does not have SMT > +* siblings. Pull tasks if candidate group has two or more > +* busy CPUs. > +*/ > + if (sg_is_smt && sg_busy_cpus >= 2) > + return true; > + > + /* > +* @dst_cpu does not have SMT siblings. @sg may have SMT > +* siblings and only one is busy. In such case, @dst_cpu > +* can help if it has higher priority and is idle. > +*/ > + return !sds->local_stat.group_util && sds->local_stat.group_util can't be used to decide if a CPU or group of CPUs is idle. util_avg is usually not null when a CPU becomes idle and you can have to wait more than
[RFC PATCH] powerpc: Investigate static_call concept
This RFC is to validate the concept of static_call on powerpc. Highly copied from x86. It replaces ppc_md.get_irq() which is called at every IRQ, by a static call. When updating the call, we just replace the instruction at the trampoline address by a relative jump to the function. For the time being, the case of out-of-range functions is not handled. Note that even if we don't immediately use static calls in powerpc code, supporting static calls would immediately benefit to core functionnalities using it, like tracing. Tested on powerpc 8xx. With the patch: <__SCT__ppc_md_get_irq>: 0: 4e 80 00 20 blr <== Replaced by 'b ' at runtime ... 0038 <__do_irq>: 38: 94 21 ff f0 stwur1,-16(r1) 3c: 7c 08 02 a6 mflrr0 40: 90 01 00 14 stw r0,20(r1) 44: 48 00 00 01 bl 44 <__do_irq+0xc> 44: R_PPC_REL24 __SCT__ppc_md_get_irq ... Before the patch: 0038 <__do_irq>: 38: 3d 20 00 00 lis r9,0 3a: R_PPC_ADDR16_HA ppc_md+0x20 3c: 94 21 ff f0 stwur1,-16(r1) 40: 81 29 00 00 lwz r9,0(r9) 42: R_PPC_ADDR16_LO ppc_md+0x20 44: 7c 08 02 a6 mflrr0 48: 90 01 00 14 stw r0,20(r1) 4c: 7d 29 03 a6 mtctr r9 50: 4e 80 04 21 bctrl ... Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/machdep.h | 3 +++ arch/powerpc/include/asm/static_call.h | 25 + arch/powerpc/kernel/Makefile | 2 +- arch/powerpc/kernel/irq.c | 2 +- arch/powerpc/kernel/setup-common.c | 4 arch/powerpc/kernel/static_call.c | 16 7 files changed, 51 insertions(+), 2 deletions(-) create mode 100644 arch/powerpc/include/asm/static_call.h create mode 100644 arch/powerpc/kernel/static_call.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 36b72d972568..c3930ea63e59 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -247,6 +247,7 @@ config PPC select HAVE_SOFTIRQ_ON_OWN_STACK select HAVE_STACKPROTECTOR if PPC32 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r2) select HAVE_STACKPROTECTOR if PPC64 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r13) + select HAVE_STATIC_CALL select HAVE_SYSCALL_TRACEPOINTS select HAVE_VIRT_CPU_ACCOUNTING select HUGETLB_PAGE_SIZE_VARIABLE if PPC_BOOK3S_64 && HUGETLB_PAGE diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 764f2732a821..ac9712312b76 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -7,6 +7,7 @@ #include #include #include +#include #include @@ -295,5 +296,7 @@ static inline void log_error(char *buf, unsigned int err_type, int fatal) #define machine_late_initcall(mach, fn) __define_machine_initcall(mach, fn, 7) #define machine_late_initcall_sync(mach, fn) __define_machine_initcall(mach, fn, 7s) +DECLARE_STATIC_CALL(ppc_md_get_irq, *ppc_md.get_irq); + #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_MACHDEP_H */ diff --git a/arch/powerpc/include/asm/static_call.h b/arch/powerpc/include/asm/static_call.h new file mode 100644 index ..335ee4ceaef9 --- /dev/null +++ b/arch/powerpc/include/asm/static_call.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_STATIC_CALL_H +#define _ASM_POWERPC_STATIC_CALL_H + +#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func) \ + asm(".pushsection .text, \"ax\" \n" \ + ".align 4 \n" \ + ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \ + STATIC_CALL_TRAMP_STR(name) ": \n" \ + " b " #func " \n" \ + ".type " STATIC_CALL_TRAMP_STR(name) ", @function \n" \ + ".size " STATIC_CALL_TRAMP_STR(name) ", . - " STATIC_CALL_TRAMP_STR(name) " \n" \ + ".popsection\n") + +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \ + asm(".pushsection .text, \"ax\" \n" \ + ".align 4 \n" \ + ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \ + STATIC_CALL_TRAMP_STR(name) ": \n" \ + " blr \n" \ + ".type " STATIC_CALL_TRAMP_STR(name) ", @function \n"
Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
Nathan Chancellor writes: > On Thu, Aug 26, 2021 at 11:54:17AM -0700, Nathan Chancellor wrote: >> On Thu, Aug 26, 2021 at 01:21:39PM +1000, Michael Ellerman wrote: >> > Nathan Chancellor writes: >> > > On Tue, Apr 13, 2021 at 04:38:10PM +, Christophe Leroy wrote: >> > >> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more >> > >> flexibility to GCC. >> > ... >> > > >> > > This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better >> > > flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in >> > > klist_add_tail to trigger over and over on boot when compiling with >> > > clang: >> > > ... >> > >> > This patch seems to fix it. Not sure if that's just papering over it >> > though. >> > >> > diff --git a/arch/powerpc/include/asm/bug.h >> > b/arch/powerpc/include/asm/bug.h >> > index 1ee0f22313ee..75fcb4370d96 100644 >> > --- a/arch/powerpc/include/asm/bug.h >> > +++ b/arch/powerpc/include/asm/bug.h >> > @@ -119,7 +119,7 @@ __label_warn_on: >> > \ >> >\ >> >WARN_ENTRY(PPC_TLNEI " %4, 0", \ >> > BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), >> > \ >> > - __label_warn_on, "r" (x)); \ >> > + __label_warn_on, "r" (!!(x))); \ >> >break; \ >> > __label_warn_on: \ >> >__ret_warn_on = true; \ >> > >> >> Thank you for the in-depth explanation and triage! I have gone ahead and >> created a standalone reproducer that shows this based on the >> preprocessed file and opened https://bugs.llvm.org/show_bug.cgi?id=51634 >> so the LLVM developers can take a look. > > Based on Eli Friedman's comment in the bug, would something like this > work and not regress GCC? I noticed that the BUG_ON macro does a cast as > well. Nick pointed out to me privately that we have run into what seems > like a similar issue before in commit 6c58f25e6938 ("riscv/atomic: Fix > sign extension for RV64I"). Yes, I read that comment this morning, and then landed at the same fix via digging through the history of our bug code. We in fact fixed a similar bug 16 years ago :} 32818c2eb6b8 ("[PATCH] ppc64: Fix issue with gcc 4.0 compiled kernels") Which is when we first started adding the cast to long. > diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h > index 1ee0f22313ee..35022667f57d 100644 > --- a/arch/powerpc/include/asm/bug.h > +++ b/arch/powerpc/include/asm/bug.h > @@ -119,7 +119,7 @@ __label_warn_on: > \ > \ > WARN_ENTRY(PPC_TLNEI " %4, 0", \ >BUGFLAG_WARNING | > BUGFLAG_TAINT(TAINT_WARN), \ > - __label_warn_on, "r" (x)); \ > + __label_warn_on, "r" ((__force long)(x))); > \ > break; \ > __label_warn_on: \ > __ret_warn_on = true; \ Yeah that fixes the clang build for me. For GCC it seems to generate the same code in the simple cases: void test_warn_on_ulong(unsigned long b) { WARN_ON(b); } void test_warn_on_int(int b) { WARN_ON(b); } I get: 0020 <.test_warn_on_ulong>: 20: 0b 03 00 00 tdnei r3,0 24: 4e 80 00 20 blr 28: 60 00 00 00 nop 2c: 60 00 00 00 nop 0030 <.test_warn_on_int>: 30: 0b 03 00 00 tdnei r3,0 34: 4e 80 00 20 blr Both before and after the change. So that's good. For: void test_warn_on_int_addition(int b) { WARN_ON(b+1); } Before I get: 0040 <.test_warn_on_int_addition>: 40: 38 63 00 01 addir3,r3,1 44: 0b 03 00 00 tdnei r3,0 48: 4e 80 00 20 blr vs after: 0040 <.test_warn_on_int_addition>: 40: 38 63 00 01 addir3,r3,1 44: 7c 63 07 b4 extsw r3,r3 48: 0b 03 00 00 tdnei r3,0 4c: 4e 80 00 20 blr So there's an extra sign extension we don't need, but I guess we can probably live with that. cheers
Re: [PATCH v2 1/4] powerpc/64: handle MSR EE and RI in interrupt entry wrapper
Hi, > Similarly to the system call change in the previous patch, the mtmsrd to I don't actually know what patch this was - I assume it's from a series that has since been applied? > enable RI can be combined with the mtmsrd to enable EE for interrupts > which enable the latter, which tends to be the important synchronous > interrupts (i.e., page faults). > > Do this by enabling EE and RI together at the beginning of the entry > wrapper if PACA_IRQ_HARD_DIS is clear, and just enabling RI if it is set > (which means something wanted EE=0). > diff --git a/arch/powerpc/include/asm/interrupt.h > b/arch/powerpc/include/asm/interrupt.h > index 6b800d3e2681..e3228a911b35 100644 > --- a/arch/powerpc/include/asm/interrupt.h > +++ b/arch/powerpc/include/asm/interrupt.h > @@ -148,9 +148,21 @@ static inline void interrupt_enter_prepare(struct > pt_regs *regs, struct interrup > #endif > > #ifdef CONFIG_PPC64 > - if (irq_soft_mask_set_return(IRQS_ALL_DISABLED) == IRQS_ENABLED) > + bool trace_enable = false; > + > + if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS)) { > + if (irq_soft_mask_set_return(IRQS_DISABLED) == IRQS_ENABLED) In the previous code we had IRQS_ALL_DISABLED, now we just have IRQS_DISABLED. Is that intentional? > + trace_enable = true; > + } else { > + irq_soft_mask_set(IRQS_DISABLED); > + } > + /* If the interrupt was taken with HARD_DIS set, don't enable MSR[EE] */ > + if (local_paca->irq_happened & PACA_IRQ_HARD_DIS) > + __hard_RI_enable(); > + else > + __hard_irq_enable(); > + if (trace_enable) > trace_hardirqs_off(); > - local_paca->irq_happened |= PACA_IRQ_HARD_DIS; > > if (user_mode(regs)) { > CT_WARN_ON(ct_state() != CONTEXT_USER); > @@ -901,11 +892,6 @@ INT_DEFINE_BEGIN(system_reset) > IVEC=0x100 > IAREA=PACA_EXNMI > IVIRT=0 /* no virt entry point */ > - /* > - * MSR_RI is not enabled, because PACA_EXNMI and nmi stack is > - * being used, so a nested NMI exception would corrupt it. > - */ > - ISET_RI=0 > ISTACK=0 > IKVM_REAL=1 > INT_DEFINE_END(system_reset) > @@ -986,8 +972,6 @@ EXC_COMMON_BEGIN(system_reset_common) Right before this change, there's a comment that reads: /* * Increment paca->in_nmi then enable MSR_RI. SLB or MCE will be able * to recover, but nested NMI will notice in_nmi and not recover ... You've taken out the bit that enables MSR_RI, which means the comment is no longer accurate. Beyond that, I'm still trying to follow all the various changes in code flows. It seems to make at least vague sense: we move the setting of MSR_RI from the early asm to interrupt*enter_prepare. I'm just struggling to convince myself that when we hit the underlying handler that the RI states all line up. Kind regards, Daniel