Re: [PATCH 3/7] KVM: s390: Use Makefile.kvm for common files
Am 16.11.21 um 12:50 schrieb David Woodhouse: From: David Woodhouse Signed-off-by: David Woodhouse Looks good. Reviewed-by: Christian Borntraeger --- arch/s390/kvm/Makefile | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/s390/kvm/Makefile b/arch/s390/kvm/Makefile index b3aaadc60ead..e4f50453cf7f 100644 --- a/arch/s390/kvm/Makefile +++ b/arch/s390/kvm/Makefile @@ -3,13 +3,11 @@ # # Copyright IBM Corp. 2008 -KVM := ../../../virt/kvm -common-objs = $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/async_pf.o \ - $(KVM)/irqchip.o $(KVM)/vfio.o $(KVM)/binary_stats.o +include $(srctree)/virt/kvm/Makefile.kvm ccflags-y := -Ivirt/kvm -Iarch/s390/kvm -kvm-objs := $(common-objs) kvm-s390.o intercept.o interrupt.o priv.o sigp.o +kvm-objs := kvm-s390.o intercept.o interrupt.o priv.o sigp.o kvm-objs += diag.o gaccess.o guestdbg.o vsie.o pv.o obj-$(CONFIG_KVM) += kvm.o
Re: [RESEND PATCH v5 0/4] Add perf interface to expose nvdimm
On 11/16/21 8:29 PM, LEROY Christophe wrote: > Hi > > Le 16/11/2021 à 05:49, Kajol Jain a écrit : >> Patchset adds performance stats reporting support for nvdimm. >> Added interface includes support for pmu register/unregister >> functions. A structure is added called nvdimm_pmu to be used for >> adding arch/platform specific data such as cpumask, nvdimm device >> pointer and pmu event functions like event_init/add/read/del. >> User could use the standard perf tool to access perf events >> exposed via pmu. >> >> Interface also defines supported event list, config fields for the >> event attributes and their corresponding bit values which are exported >> via sysfs. Patch 3 exposes IBM pseries platform nmem* device >> performance stats using this interface. > > You resending your v5 series ? Is there any news since your submittion > last month that's awaiting in patchwork here at > https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=264422 ? Hi Christophe, There is no code side changes from the last v5 version. Since, I am looking for reviews, again posted this patchset with RESEND tag. Thanks, Kajol Jain > > Christophe > > >> >> Result from power9 pseries lpar with 2 nvdimm device: >> >> Ex: List all event by perf list >> >> command:# perf list nmem >> >>nmem0/cache_rh_cnt/[Kernel PMU event] >>nmem0/cache_wh_cnt/[Kernel PMU event] >>nmem0/cri_res_util/[Kernel PMU event] >>nmem0/ctl_res_cnt/ [Kernel PMU event] >>nmem0/ctl_res_tm/ [Kernel PMU event] >>nmem0/fast_w_cnt/ [Kernel PMU event] >>nmem0/host_l_cnt/ [Kernel PMU event] >>nmem0/host_l_dur/ [Kernel PMU event] >>nmem0/host_s_cnt/ [Kernel PMU event] >>nmem0/host_s_dur/ [Kernel PMU event] >>nmem0/med_r_cnt/ [Kernel PMU event] >>nmem0/med_r_dur/ [Kernel PMU event] >>nmem0/med_w_cnt/ [Kernel PMU event] >>nmem0/med_w_dur/ [Kernel PMU event] >>nmem0/mem_life/[Kernel PMU event] >>nmem0/poweron_secs/[Kernel PMU event] >>... >>nmem1/mem_life/[Kernel PMU event] >>nmem1/poweron_secs/[Kernel PMU event] >> >> Patch1: >> Introduces the nvdimm_pmu structure >> Patch2: >> Adds common interface to add arch/platform specific data >> includes nvdimm device pointer, pmu data along with >> pmu event functions. It also defines supported event list >> and adds attribute groups for format, events and cpumask. >> It also adds code for cpu hotplug support. >> Patch3: >> Add code in arch/powerpc/platform/pseries/papr_scm.c to expose >> nmem* pmu. It fills in the nvdimm_pmu structure with pmu name, >> capabilities, cpumask and event functions and then registers >> the pmu by adding callbacks to register_nvdimm_pmu. >> Patch4: >> Sysfs documentation patch >> >> Changelog >> --- >> v4 -> v5: >> - Remove multiple variables defined in nvdimm_pmu structure include >>name and pmu functions(event_int/add/del/read) as they are just >>used to copy them again in pmu variable. Now we are directly doing >>this step in arch specific code as suggested by Dan Williams. >> >> - Remove attribute group field from nvdimm pmu structure and >>defined these attribute groups in common interface which >>includes format, event list along with cpumask as suggested by >>Dan Williams. >>Since we added static defination for attrbute groups needed in >>common interface, removes corresponding code from papr. >> >> - Add nvdimm pmu event list with event codes in the common interface. >> >> - Remove Acked-by/Reviewed-by/Tested-by tags as code is refactored >>to handle review comments from Dan. >> >> - Make nvdimm_pmu_free_hotplug_memory function static as reported >>by kernel test robot, also add corresponding Reported-by tag. >> >> - Link to the patchset v4: https://lkml.org/lkml/2021/9/3/45 >> >> v3 -> v4 >> - Rebase code on top of current papr_scm code without any logical >>changes. >> >> - Added Acked-by tag from Peter Zijlstra and Reviewed by tag >>from Madhavan Srinivasan. >> >> - Link to the patchset v3: https://lkml.org/lkml/2021/6/17/605 >> >> v2 -> v3 >> - Added Tested-by tag. >> >> - Fix nvdimm mailing list in the ABI Documentation. >> >> - Link to the patchset v2: https://lkml.org/lkml/2021/6/14/25 >> >> v1 -> v2 >> - Fix hotplug code by adding pmu migration call >>
Re: [PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements
Nathan Lynch writes: > > Signed-off-by: Nathan Lynch > --- > > Notes: > This could be considered a sequel to: > > > https://lore.kernel.org/linuxppc-dev/20210504030358.1715034-1-nath...@linux.ibm.com/ > > which tried to address both the suboptimal delay behavior as well as > naming > issues. The present change achieves the performance improvement and leaves > the matter of naming for another time. > > I've incorporated Alexey's feedback from that series to avoid division > when > calculating the arguments to usleep_range(). Oops, this part is redundant with information in the cover. Git should discard it if/when applied, though.
Re: [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic
On 11/11/21 6:50 pm, Nicholas Piggin wrote: Excerpts from Hari Bathini's message of November 11, 2021 10:06 pm: On 11/11/21 11:44 am, Michael Ellerman wrote: Hari Bathini writes: In panic path, fadump is triggered via a panic notifier function. Before calling panic notifier functions, smp_send_stop() gets called, which stops all CPUs except the panic'ing CPU. Commit 8389b37dffdc ("powerpc: stop_this_cpu: remove the cpu from the online map.") and again commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()") started marking CPUs as offline while stopping them. So, if a kernel has either of the above commits, vmcore captured with fadump via panic path would show only the panic'ing CPU as online. Sample output of crash-utility with such vmcore: # crash vmlinux vmcore ... KERNEL: vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 1 DATE: Wed Nov 10 09:56:34 EST 2021 UPTIME: 00:00:42 LOAD AVERAGE: 2.27, 0.69, 0.24 TASKS: 183 NODENAME: X RELEASE: 5.15.0+ VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021 MACHINE: ppc64le (2500 Mhz) MEMORY: 8 GB PANIC: "Kernel panic - not syncing: sysrq triggered crash" PID: 3394 COMMAND: "bash" TASK: c000150a5f80 [THREAD_INFO: c000150a5f80] CPU: 1 STATE: TASK_RUNNING (PANIC) crash> p -x __cpu_online_mask __cpu_online_mask = $1 = { bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0} } crash> crash> crash> p -x __cpu_active_mask __cpu_active_mask = $2 = { bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0} } crash> While this has been the case since fadump was introduced, the issue was not identified for two probable reasons: - In general, the bulk of the vmcores analyzed were from crash due to exception. - The above did change since commit 8341f2f222d7 ("sysrq: Use panic() to force a crash") started using panic() instead of deferencing NULL pointer to force a kernel crash. But then commit de6e5d38417e ("powerpc: smp_send_stop do not offline stopped CPUs") stopped marking CPUs as offline till kernel commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()") reverted that change. To avoid vmcore from showing only one CPU as online in panic path, skip marking non panic'ing CPUs as offline while stopping them during fadump crash. Is this really worth the added complexity/bug surface? Why does it matter if the vmcore shows only one CPU online? We lose out on backtrace/register data of non-crashing CPUs as they are explicitly marked offline. Actually, the state of CPU resources is explicitly changed after the panic though the aim of vmcore is to capture the system state at the time of panic... Alternatively, how about moving crash_fadump() call from panic notifier into panic() function explicitly, just like __crash_kexec() - before the smp_send_stop() call, so as to remove dependency with smp_send_stop() implementation altogether... Does the crash dump code snapshot the CPUs with send_debuggr_break? I can't remember the exact flow. But if it sends NMI IPIs to all other CPUs then maybe it could be moved before smp_send_stop, good idea. Except for crashing CPU, snapshot of register data for all other CPUs is collected by f/w on ibm,os-term rtas call. The comment talks about printk_safe_flush_on_panic(), and this change would presumably break that. Except that printk_safe_flush_on_panic() no longer exists. So do we need to set the CPU online here at all? ie. could we revert bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()") now that printk_safe_flush_on_panic() no longer exists? Yeah, sounds like the logical thing to do but I guess, Nick would be in a better position to answer this.. Maybe we could look at reverting it, it would be nice. But I think it would be good to consider moving crash_fadump as well. There is at OK, I will try moving crash_fadump() instead and post the patch.. Thanks Hari
[PATCH v2 2/2] powerpc/rtas: rtas_busy_delay_time() kernel-doc
Provide API documentation for rtas_busy_delay_time(), explaining why we return the same value for 9900 and -2. Signed-off-by: Nathan Lynch --- arch/powerpc/kernel/rtas.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index d686834fe7f5..4cfe9f93a9cd 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -492,8 +492,25 @@ int rtas_call(int token, int nargs, int nret, int *outputs, ...) } EXPORT_SYMBOL(rtas_call); -/* For RTAS_BUSY (-2), delay for 1 millisecond. For an extended busy status - * code of 990n, perform the hinted delay of 10^n (last digit) milliseconds. +/** + * rtas_busy_delay_time() - From an RTAS status value, calculate the + * suggested delay time in milliseconds. + * + * @status: a value returned from rtas_call() or similar APIs which return + * the status of a RTAS function call. + * + * Context: Any context. + * + * Return: + * * 10 - If @status is 9905. + * * 1 - If @status is 9904. + * * 1000 - If @status is 9903. + * * 100- If @status is 9902. + * * 10 - If @status is 9901. + * * 1 - If @status is either 9900 or -2. This is "wrong" for -2, but + *some callers depend on this behavior, and the worst outcome + *is that they will delay for longer than necessary. + * * 0 - If @status is not a busy or extended delay value. */ unsigned int rtas_busy_delay_time(int status) { -- 2.31.1
[PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling
This can be considered a successor to: https://lore.kernel.org/linuxppc-dev/20210504030358.1715034-1-nath...@linux.ibm.com/ which tried to address both the suboptimal delay behavior as well as API issues. This version achieves the performance improvements and leaves major API changes for another time. Changes since v1: * Drop major API changes. * Avoid division when calculating the arguments to usleep_range() (Alexey). * Improve kernel-doc for rtas_busy_delay(), rtas_busy_delay_time(). Nathan Lynch (2): powerpc/rtas: rtas_busy_delay() improvements powerpc/rtas: rtas_busy_delay_time() kernel-doc arch/powerpc/include/asm/rtas.h | 2 +- arch/powerpc/kernel/rtas.c | 95 + 2 files changed, 87 insertions(+), 10 deletions(-) -- 2.31.1
[PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements
Generally RTAS cannot block, and in PAPR it is required to return control to the OS within a few tens of microseconds. In order to support operations which may take longer to complete, many RTAS primitives can return intermediate -2 ("busy") or 990x ("extended delay") values, which indicate that the OS should reattempt the same call with the same arguments at some point in the future. Current versions of PAPR are less than clear about this, but the intended meanings of these values in more detail are: RTAS_BUSY (-2): RTAS has suspended a potentially long-running operation in order to meet its latency obligation and give the OS the opportunity to perform other work. RTAS can resume making progress as soon as the OS reattempts the call. RTAS_EXTENDED_DELAY_{MIN...MAX} (9900-9905): RTAS must wait for an external event to occur or for internal contention to resolve before it can complete the requested operation. The value encodes a non-binding hint as to roughly how long the OS should wait before calling again, but the OS is allowed to reattempt the call sooner or even immediately. Linux of course must take its own CPU scheduling obligations into account when handling these statuses; e.g. a task which receives an RTAS_BUSY status should check whether to reschedule before it attempts the RTAS call again to avoid starving other tasks. rtas_busy_delay() is a helper function that "consumes" a busy or extended delay status. Common usage: int rc; do { rc = rtas_call(rtas_token("some-function"), ...); } while (rtas_busy_delay(rc)); /* convert rc to Linux error value, etc */ If rc is a busy or extended delay status, the caller can rely on rtas_busy_delay() to perform an appropriate sleep or reschedule and return nonzero. Other statuses are handled normally by the caller. The current implementation of rtas_busy_delay() both oversleeps and overuses the CPU: * It performs msleep() for all 990x and even when no delay is suggested (-2), but this is understood to actually sleep for two jiffies minimum in practice (20ms with HZ=100). 9900 (1ms) and 9901 (10ms) appear to be the most common extended delay statuses, and the oversleeping measurably lengthens DLPAR operations, which perform many RTAS calls. * It does not sleep on 990x unless need_resched() is true, causing code like the loop above to needlessly retry, wasting CPU time. Alter the logic to align better with the intended meanings: * When passed RTAS_BUSY, perform cond_resched() and return without sleeping. The caller should reattempt immediately * Always sleep when passed an extended delay status, using usleep_range() for precise shorter sleeps. Limit the sleep time to one second even though there are higher architected values. Change rtas_busy_delay()'s return type to bool to better reflect its usage, and add kernel-doc. rtas_busy_delay_time() is unchanged, even though it "incorrectly" returns 1 for RTAS_BUSY. There are users of that API with open-coded delay loops in sensitive contexts that will have to be taken on an individual basis. Brief results for addition and removal of 5GB memory on a small P9 PowerVM partition follow. Load was generated with stress-ng --cpu N. For add, elapsed time is greatly reduced without significant change in the number of RTAS calls or time spent on CPU. For remove, elapsed time is modestly reduced, with significant reductions in RTAS calls and time spent on CPU. With no competing workload (- before, + after): Performance counter stats for 'bash -c echo "memory add count 20" > /sys/kernel/dlpar' (10 runs): - 1,935 probe:rtas_call #0.003 M/sec ( +- 0.22% ) -609.99 msec task-clock#0.183 CPUs utilized ( +- 0.19% ) + 1,956 probe:rtas_call #0.003 M/sec ( +- 0.17% ) +618.56 msec task-clock#0.278 CPUs utilized ( +- 0.11% ) -3.3322 +- 0.0670 seconds time elapsed ( +- 2.01% ) +2. +- 0.0416 seconds time elapsed ( +- 1.87% ) Performance counter stats for 'bash -c echo "memory remove count 20" > /sys/kernel/dlpar' (10 runs): - 6,224 probe:rtas_call #0.008 M/sec ( +- 2.57% ) -750.36 msec task-clock#0.190 CPUs utilized ( +- 2.01% ) + 843 probe:rtas_call #0.003 M/sec ( +- 0.12% ) +250.66 msec task-clock#0.068 CPUs utilized ( +- 0.17% ) -3.9394 +- 0.0890 seconds time elapsed ( +- 2.26% ) + 3.678 +- 0.113 seconds time elapsed ( +- 3.07% ) With all CPUs 100% busy (- before, + after): Performance counter stats for 'bash -c echo "memory add count 20" > /sys/kernel/dlpar' (10 runs): - 2,979
Re: [PATCH 12/13] sysctl: add helper to register empty subdir
On Fri, May 29, 2020 at 08:03:02AM -0500, Eric W. Biederman wrote: > Luis Chamberlain writes: > > > The way to create a subdirectory from the base set of directories > > is a bit obscure, so provide a helper which makes this clear, and > > also helps remove boiler plate code required to do this work. > > I agreee calling: > register_sysctl("fs/binfmt_misc", sysctl_mount_point) > is a bit obscure but if you are going to make a wrapper > please make it the trivial one liner above. > > Say something that looks like: > struct sysctl_header *register_sysctl_mount_point(const char *path) > { > return register_sysctl(path, sysctl_mount_point); > } > > And yes please talk about a mount point and not an empty dir, as these > are permanently empty directories to serve as mount points. There are > some subtle but important permission checks this allows in the case of > unprivileged mounts. > > Further code like this belong in proc_sysctl.c next to all of the code > it is related to so that it is easier to see how to refactor the code if > necessary. Alrighty, it's been a while since this kernel/sysctl.c kitchen sink cleanup... so it's time to respin this now that the merge window is open. I already rebased patches, addressed all input and now just waiting to fix any compilation errors. I'm going to split the patches up into real small sets so to ensure we just get this through becauase getting this in otherwise is going to be hard. I'd appreciate folk's review once the patches start going out. I think a hard part will be deciding what tree this should got through. Luis
Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.
On 16 Nov 2021, at 3:58, David Hildenbrand wrote: > On 15.11.21 20:37, Zi Yan wrote: >> From: Zi Yan >> >> Hi David, > > Hi, > > thanks for looking into this. > >> >> You suggested to make alloc_contig_range() deal with pageblock_order instead >> of >> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This >> patchset is my attempt to achieve that. Please take a look and let me know if >> I am doing it correctly or not. >> >> From what my understanding, cma required alignment of >> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was introduced, >> __free_one_page() does not prevent merging two different pageblocks, when >> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() implementation >> does prevent that. It should be OK to just align cma to pageblock_order. >> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use >> pageblock_order as alignment too. > > I wonder if that's sufficient. Especially the outer_start logic in > alloc_contig_range() might be problematic. There are some ugly corner > cases with free pages/allocations spanning multiple pageblocks and we > only isolated a single pageblock. Thank you a lot for writing the list of these corner cases. They are very helpful! > > > Regarding CMA, we have to keep the following cases working: > > a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER >- 1 page: >[ MAX_ ORDER - 1 ] >[ pageblock 0 | pageblock 1] > > Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA, > but not both. We have to make sure alloc_contig_range() keeps working > correctly. This should be the case even with your change, as we won't > merging pages accross differing migratetypes. Yes. > > b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: >[ MAX_ ORDER - 1 ] >[ pageblock 0 | pageblock 1] > > Assume both are MIGRATE_CMA. Assume we want to either allocate from > pageblock 0 or pageblock 1. Especially, assume we want to allocate from > pageblock 1. While we would isolate pageblock 1, we wouldn't isolate > pageblock 0. > > What happens if we either have a free page spanning the MAX_ORDER - 1 > range already OR if we have to migrate a MAX_ORDER - 1 page, resulting > in a free MAX_ORDER - 1 page of which only the second pageblock is > isolated? We would end up essentially freeing a page that has mixed > pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I > might be wrong but I have the feeling that this would be problematic. > This could happen when start_isolate_page_range() stumbles upon a compound page with order >= pageblock_order or a free page with order >= pageblock_order, but should not. start_isolate_page_range() should check the actual page size, either compound page size or free page size, and set the migratetype across pageblocks if the page is bigger than pageblock size. More precisely set_migratetype_isolate() should do that. > c) Concurrent allocations: > [ MAX_ ORDER - 1 ] > [ pageblock 0 | pageblock 1] > > Assume b) but we have two concurrent CMA allocations to pageblock 0 and > pageblock 1, which would now be possible as start_isolate_page_range() > isolate would succeed on both. Two isolations will be serialized by the zone lock taken by set_migratetype_isolate(), so the concurrent allocation would not be a problem. If it is a MAX_ORDER-1 free page, the first comer should split it and only isolate one of the pageblock then second one can isolate the other pageblock. If it is a MAX_ORDER-1 compound page, the first comer should isolate both pageblocks, then the second one would fail. WDYT? In sum, it seems to me that the issue is page isolation code only sees pageblock without check the actual page. When there are multiple pageblocks belonging to one page, the problem appears. This should be fixed. > > > Regarding virtio-mem, we care about the following cases: > > a) Allocating parts from completely movable MAX_ ORDER - 1 page: >[ MAX_ ORDER - 1 ] >[ pageblock 0 | pageblock 1] > > Assume pageblock 0 and pageblock 1 are either free or contain only > movable pages. Assume we allocated pageblock 0. We have to make sure we > can allocate pageblock 1. The other way around, assume we allocated > pageblock 1, we have to make sure we can allocate pageblock 0. > > Free pages spanning both pageblocks might be problematic. Can you elaborate a bit? If either of pageblock 0 and 1 is used by virtio-mem, why do we care the other? If pageblock 0 and 1 belong to the same page (either free or compound), they should have the same migratetype. If we want to just allocate one of them, we can split the free page or migrate the compound page then split the remaining free page. > > b) Allocate parts of partially movable MAX_ ORDER - 1 page: >[ MAX_ ORDER - 1 ] >[ pageblock 0 | pageblock 1] > > Assume pageblock 0 contains unmovable data but
Re: [PATCH] powerpc/pseries: delete scanlog
Nathan Lynch writes: > Remove the pseries scanlog driver. > > This code supports functions from Power4-era servers that are not present > on targets currently supported by arch/powerpc. System manuals from this > time have this description: > > Scan Dump data is a set of chip data that the service processor gathers > after a system malfunction. It consists of chip scan rings, chip trace > arrays, and Scan COM (SCOM) registers. This data is stored in the > scan-log partition of the system’s Nonvolatile Random Access > Memory (NVRAM). > > PowerVM partition firmware development doesn't recognize the associated > function call or property, and they don't see any references to them in > their codebase. It seems to have been specific to non-virtualized > pseries. Just bumping this to see if there are any objections.
Re: Build regressions/improvements in v5.16-rc1
On 11/16/21 5:59 PM, Nick Terrell wrote: On Nov 15, 2021, at 8:44 AM, Helge Deller wrote: On 11/15/21 17:12, Geert Uytterhoeven wrote: On Mon, Nov 15, 2021 at 4:54 PM Geert Uytterhoeven wrote: Below is the list of build error/warning regressions/improvements in v5.16-rc1[1] compared to v5.15[2]. Summarized: - build errors: +20/-13 - build warnings: +3/-28 Happy fixing! ;-) Thanks to the linux-next team for providing the build service. [1] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf/ (all 90 configs) [2] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/8bb7eca972ad531c9b149c0a51ab43a417385813/ (all 90 configs) *** ERRORS *** 20 error regressions: + /kisskb/src/arch/parisc/include/asm/jump_label.h: error: expected ':' before '__stringify': => 33:4, 18:4 + /kisskb/src/arch/parisc/include/asm/jump_label.h: error: label 'l_yes' defined but not used [-Werror=unused-label]: => 38:1, 23:1 due to static_branch_likely() in crypto/api.c parisc-allmodconfig fixed now in the parisc for-next git tree. + /kisskb/src/drivers/gpu/drm/msm/msm_drv.h: error: "COND" redefined [-Werror]: => 531 + /kisskb/src/lib/zstd/compress/zstd_double_fast.c: error: the frame size of 3252 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]: => 47:1 + /kisskb/src/lib/zstd/compress/zstd_double_fast.c: error: the frame size of 3360 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]: => 499:1 + /kisskb/src/lib/zstd/compress/zstd_double_fast.c: error: the frame size of 5344 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]: => 334:1 + /kisskb/src/lib/zstd/compress/zstd_double_fast.c: error: the frame size of 5380 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]: => 354:1 + /kisskb/src/lib/zstd/compress/zstd_fast.c: error: the frame size of 1824 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]: => 372:1 + /kisskb/src/lib/zstd/compress/zstd_fast.c: error: the frame size of 2224 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]: => 204:1 + /kisskb/src/lib/zstd/compress/zstd_fast.c: error: the frame size of 3800 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]: => 476:1 parisc-allmodconfig parisc needs much bigger frame sizes, so I'm not astonished here. During the v5.15 cycl I increased it to 1536 (from 1280), so I'm simply tempted to increase it this time to 4096, unless someone has a better idea This patch set should fix the zstd stack size warnings [0]. I’ve verified the fix using the same tooling: gcc-8-hppa-linux-gnu. I’ll send the PR to Linus tomorrow. I’ve been informed that it isn't strictly necessary to send the patches to the mailing list for bug fixes, but its already done, so I’ll wait and see if there is any feedback. IMO several (or many more) people would disagree with that. "strictly?" OK, it's probably possible that almost any patch could be merged without being on a mailing list, but it's not desirable (except in the case of "security" patches). -- ~Randy
[powerpc:fixes-test] BUILD SUCCESS 1e35eba4055149c578baf0318d2f2f89ea3c44a0
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes-test branch HEAD: 1e35eba4055149c578baf0318d2f2f89ea3c44a0 powerpc/8xx: Fix pinned TLBs with CONFIG_STRICT_KERNEL_RWX elapsed time: 724m configs tested: 190 configs skipped: 104 The following configs have been built successfully. More configs may be tested in the coming days. gcc tested configs: arm defconfig arm64allyesconfig arm64 defconfig arm allyesconfig arm allmodconfig i386 randconfig-c001-2026 mips randconfig-c004-2026 powerpc mpc836x_rdk_defconfig arm pxa_defconfig powerpc mpc8313_rdb_defconfig sh r7785rp_defconfig s390 allmodconfig microblaze mmu_defconfig mips tb0287_defconfig sh sh7724_generic_defconfig m68k m5249evb_defconfig x86_64 defconfig armcerfcube_defconfig powerpc eiger_defconfig mips loongson2k_defconfig powerpcsocrates_defconfig xtensa common_defconfig mips decstation_64_defconfig m68kq40_defconfig archsdk_defconfig arm s5pv210_defconfig xtensa iss_defconfig powerpc mpc837x_rdb_defconfig powerpc ps3_defconfig sh rts7751r2d1_defconfig shtitan_defconfig um i386_defconfig ia64 allyesconfig shhp6xx_defconfig powerpc ppc44x_defconfig powerpc mpc85xx_cds_defconfig s390 zfcpdump_defconfig armmini2440_defconfig microblaze defconfig sh landisk_defconfig arm at91_dt_defconfig m68km5407c3_defconfig openriscdefconfig mips decstation_r4k_defconfig powerpc pq2fads_defconfig powerpc walnut_defconfig arm mxs_defconfig mips ath79_defconfig powerpc makalu_defconfig mips sb1250_swarm_defconfig armclps711x_defconfig arm socfpga_defconfig powerpc mpc834x_itxgp_defconfig mips rs90_defconfig arm exynos_defconfig powerpccell_defconfig arm axm55xx_defconfig arm sama5_defconfig mips rm200_defconfig arm stm32_defconfig arm spear13xx_defconfig ia64 alldefconfig mips malta_defconfig arm versatile_defconfig arm gemini_defconfig arm eseries_pxa_defconfig arm u8500_defconfig mips cobalt_defconfig sh secureedge5410_defconfig powerpc bamboo_defconfig h8300 defconfig arm aspeed_g4_defconfig arm imx_v6_v7_defconfig powerpc mpc832x_mds_defconfig s390 alldefconfig arm imx_v4_v5_defconfig sh apsh4a3a_defconfig armhisi_defconfig mips rb532_defconfig ia64defconfig armmagician_defconfig powerpcmpc7448_hpc2_defconfig arm collie_defconfig powerpc ep88xc_defconfig arc tb10x_defconfig armdove_defconfig nios2 3c120_defconfig arm mainstone_defconfig powerpc wii_defconfig powerpc mpc512x_defconfig sh kfr2r09-romimage_defconfig sh rsk7264_defconfig m68k apollo_defconfig shshmin_defconfig h8300alldefconfig mips rt305x_defconfig armmps2_defconfig xtensa cadence_csp_defconfig sh ap325rxa_defconfig m68k amcore_defconfig arm
[powerpc:fixes] BUILD SUCCESS 302039466f6a3b9421ecb9a6a2c528801dc24a86
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes branch HEAD: 302039466f6a3b9421ecb9a6a2c528801dc24a86 powerpc/pseries: Fix numa FORM2 parsing fallback code elapsed time: 725m configs tested: 153 configs skipped: 108 The following configs have been built successfully. More configs may be tested in the coming days. gcc tested configs: arm defconfig arm64allyesconfig arm64 defconfig arm allyesconfig arm allmodconfig i386 randconfig-c001-2026 mips randconfig-c004-2026 powerpc mpc836x_rdk_defconfig arm pxa_defconfig powerpc mpc8313_rdb_defconfig sh r7785rp_defconfig s390 allmodconfig microblaze mmu_defconfig mips tb0287_defconfig sh sh7724_generic_defconfig m68k m5249evb_defconfig armcerfcube_defconfig powerpc eiger_defconfig m68kq40_defconfig archsdk_defconfig arm s5pv210_defconfig xtensa iss_defconfig powerpc mpc837x_rdb_defconfig powerpc ps3_defconfig sh rts7751r2d1_defconfig shtitan_defconfig um i386_defconfig ia64 allyesconfig s390 zfcpdump_defconfig armmini2440_defconfig microblaze defconfig sh landisk_defconfig arm at91_dt_defconfig m68km5407c3_defconfig openriscdefconfig mips decstation_r4k_defconfig arm axm55xx_defconfig arm sama5_defconfig mips rm200_defconfig arm stm32_defconfig arm spear13xx_defconfig ia64 alldefconfig powerpc mpc834x_itxgp_defconfig mips malta_defconfig arm versatile_defconfig arm imx_v4_v5_defconfig sh apsh4a3a_defconfig armhisi_defconfig mips rb532_defconfig ia64defconfig armmagician_defconfig powerpcmpc7448_hpc2_defconfig mips ath79_defconfig arm collie_defconfig nios2 3c120_defconfig arm mainstone_defconfig powerpc wii_defconfig powerpc mpc512x_defconfig powerpc mpc85xx_cds_defconfig sh kfr2r09-romimage_defconfig sh rsk7264_defconfig m68k apollo_defconfig shshmin_defconfig h8300alldefconfig mips rt305x_defconfig powerpc mpc832x_mds_defconfig armmps2_defconfig xtensa cadence_csp_defconfig sh ap325rxa_defconfig m68k amcore_defconfig armmvebu_v7_defconfig shdreamcast_defconfig armvt8500_v6_v7_defconfig arcnsim_700_defconfig arm milbeaut_m10v_defconfig sh sh7770_generic_defconfig xtensa virt_defconfig ia64 tiger_defconfig arm randconfig-c002-2026 ia64 allmodconfig m68k allmodconfig m68kdefconfig m68k allyesconfig nios2 defconfig arc allyesconfig nds32 allnoconfig nds32 defconfig nios2allyesconfig cskydefconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig arc defconfig sh allmodconfig parisc defconfig s390 allyesconfig parisc allyesconfig s390defconfig i386 allyesconfig sparc
[powerpc:next-test] BUILD SUCCESS 098a78a0f7f776200f1c5694983a37b77c522dbb
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next-test branch HEAD: 098a78a0f7f776200f1c5694983a37b77c522dbb powerpc/code-patching: Improve verification of patchability elapsed time: 721m configs tested: 195 configs skipped: 4 The following configs have been built successfully. More configs may be tested in the coming days. gcc tested configs: arm defconfig arm64allyesconfig arm64 defconfig arm allyesconfig arm allmodconfig i386 randconfig-c001-2026 mips randconfig-c004-2026 powerpc mpc836x_rdk_defconfig arm pxa_defconfig powerpc mpc8313_rdb_defconfig sh r7785rp_defconfig s390 allmodconfig microblaze mmu_defconfig mips tb0287_defconfig sh sh7724_generic_defconfig m68k m5249evb_defconfig x86_64 defconfig armcerfcube_defconfig powerpc eiger_defconfig mips loongson2k_defconfig powerpcsocrates_defconfig xtensa common_defconfig mips decstation_64_defconfig m68kq40_defconfig archsdk_defconfig arm s5pv210_defconfig xtensa iss_defconfig powerpc mpc837x_rdb_defconfig powerpc ps3_defconfig sh rts7751r2d1_defconfig shtitan_defconfig um i386_defconfig ia64 allyesconfig shhp6xx_defconfig powerpc ppc44x_defconfig powerpc mpc85xx_cds_defconfig s390 zfcpdump_defconfig armmini2440_defconfig microblaze defconfig mips malta_kvm_defconfig sparc64 alldefconfig mips ath25_defconfig arm simpad_defconfig sh landisk_defconfig arm at91_dt_defconfig m68km5407c3_defconfig openriscdefconfig mips decstation_r4k_defconfig powerpc pq2fads_defconfig powerpc walnut_defconfig arm mxs_defconfig mips ath79_defconfig powerpc makalu_defconfig mips sb1250_swarm_defconfig armclps711x_defconfig arm socfpga_defconfig powerpc mpc834x_itxgp_defconfig mips rs90_defconfig arm exynos_defconfig powerpccell_defconfig arm axm55xx_defconfig arm sama5_defconfig mips rm200_defconfig arm stm32_defconfig arm spear13xx_defconfig ia64 alldefconfig mips malta_defconfig arm versatile_defconfig sh secureedge5410_defconfig powerpc bamboo_defconfig h8300 defconfig arm aspeed_g4_defconfig arm imx_v6_v7_defconfig powerpc mpc832x_mds_defconfig s390 alldefconfig arm imx_v4_v5_defconfig sh apsh4a3a_defconfig armhisi_defconfig mips rb532_defconfig arm hackkit_defconfig armqcom_defconfig sh r7780mp_defconfig shmigor_defconfig arm orion5x_defconfig ia64defconfig armmagician_defconfig powerpcmpc7448_hpc2_defconfig arm collie_defconfig powerpc ep88xc_defconfig arc tb10x_defconfig armdove_defconfig nios2 3c120_defconfig arm mainstone_defconfig powerpc wii_defconfig powerpc mpc512x_defconfig sh kfr2r09-romimage_defconfig sh rsk7264_defconfig m68k apollo_defconfig shshmin_defconfig h8300alldefconfig mips
[powerpc:merge] BUILD SUCCESS 64044d20a2d777d2346cde7eb41d17497c4f4449
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge branch HEAD: 64044d20a2d777d2346cde7eb41d17497c4f4449 Automatic merge of 'fixes' into merge (2021-11-16 22:46) elapsed time: 721m configs tested: 140 configs skipped: 3 The following configs have been built successfully. More configs may be tested in the coming days. gcc tested configs: arm defconfig arm64allyesconfig arm64 defconfig arm allyesconfig arm allmodconfig i386 randconfig-c001-2026 mips tb0287_defconfig sh sh7724_generic_defconfig m68k m5249evb_defconfig x86_64 defconfig armcerfcube_defconfig powerpc eiger_defconfig mips cu1830-neo_defconfig powerpc pcm030_defconfig arm sama5_defconfig sparc64 alldefconfig s390 zfcpdump_defconfig armmini2440_defconfig microblaze defconfig mips cavium_octeon_defconfig powerpcadder875_defconfig m68k amcore_defconfig powerpc ppc40x_defconfig arcvdk_hs38_defconfig armmvebu_v5_defconfig arm mv78xx0_defconfig sh se7619_defconfig xtensageneric_kc705_defconfig powerpc64 defconfig arm axm55xx_defconfig mips rm200_defconfig arm stm32_defconfig arm spear13xx_defconfig ia64 alldefconfig powerpc ps3_defconfig powerpc mpc836x_rdk_defconfig arm gemini_defconfig arm eseries_pxa_defconfig arm u8500_defconfig mips cobalt_defconfig armlart_defconfig powerpc pseries_defconfig shhp6xx_defconfig mips maltaaprp_defconfig powerpc xes_mpc85xx_defconfig arcvdk_hs38_smp_defconfig shapsh4ad0a_defconfig s390 alldefconfig s390 debug_defconfig sh kfr2r09-romimage_defconfig sh rsk7264_defconfig m68k apollo_defconfig shshmin_defconfig h8300alldefconfig mips rt305x_defconfig powerpc mpc832x_mds_defconfig armmps2_defconfig xtensa cadence_csp_defconfig riscv allnoconfig armvt8500_v6_v7_defconfig arcnsim_700_defconfig arm milbeaut_m10v_defconfig sh sh7770_generic_defconfig powerpc mpc85xx_cds_defconfig arm randconfig-c002-2026 ia64 allmodconfig ia64defconfig ia64 allyesconfig m68k allmodconfig m68kdefconfig m68k allyesconfig nios2 defconfig arc allyesconfig nds32 allnoconfig nds32 defconfig nios2allyesconfig cskydefconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig arc defconfig sh allmodconfig parisc defconfig s390 allyesconfig s390 allmodconfig parisc allyesconfig s390defconfig i386 allyesconfig sparcallyesconfig sparc defconfig i386defconfig i386 debian-10.3 mips allyesconfig mips allmodconfig powerpc allyesconfig powerpc allmodconfig powerpc allnoconfig x86_64 randconfig-a015-2026 x86_64 randconfig-a013-2026 x86_64
[PATCH] powerpc/rtas: kernel-doc fixes
Fix the following issues reported by kernel-doc: $ scripts/kernel-doc -v -none arch/powerpc/kernel/rtas.c arch/powerpc/kernel/rtas.c:810: info: Scanning doc for function rtas_activate_firmware arch/powerpc/kernel/rtas.c:818: warning: contents before sections arch/powerpc/kernel/rtas.c:841: info: Scanning doc for function rtas_call_reentrant arch/powerpc/kernel/rtas.c:893: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Find a specific pseries error log in an RTAS extended event log. Signed-off-by: Nathan Lynch --- arch/powerpc/kernel/rtas.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index ff80bbad22a5..ca27421f471a 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -809,13 +809,13 @@ void rtas_os_term(char *str) /** * rtas_activate_firmware() - Activate a new version of firmware. * + * Context: This function may sleep. + * * Activate a new version of partition firmware. The OS must call this * after resuming from a partition hibernation or migration in order * to maintain the ability to perform live firmware updates. It's not * catastrophic for this method to be absent or to fail; just log the * condition in that case. - * - * Context: This function may sleep. */ void rtas_activate_firmware(void) { @@ -890,11 +890,12 @@ int rtas_call_reentrant(int token, int nargs, int nret, int *outputs, ...) #endif /* CONFIG_PPC_PSERIES */ /** - * Find a specific pseries error log in an RTAS extended event log. + * get_pseries_errorlog() - Find a specific pseries error log in an RTAS + * extended event log. * @log: RTAS error/event log * @section_id: two character section identifier * - * Returns a pointer to the specified errorlog or NULL if not found. + * Return: A pointer to the specified errorlog or NULL if not found. */ struct pseries_errorlog *get_pseries_errorlog(struct rtas_error_log *log, uint16_t section_id) -- 2.31.1
RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to unrecoverable loop.
On 02.11.21 22:15, Joakim Tjernlund wrote: > On Sat, 2021-10-30 at 14:20 +, Joakim Tjernlund wrote: >> On Fri, 2021-10-29 at 17:14 +, Eugene Bordenkircher wrote: > >>> We've discovered a situation where the FSL udc driver >>> (drivers/usb/gadget/udc/fsl_udc_core.c) will enter a loop iterating over >>> the request queue, but the queue has been corrupted at some point so it >>> loops infinitely. I believe we have narrowed into the offending code, but >>> we are in need of assistance trying to find an appropriate fix for the >>> problem. The identified code appears to be in all versions of the Linux >>> kernel the driver exists in. >>> >>> The problem appears to be when handling a USB_REQ_GET_STATUS request. The >>> driver gets this request and then calls the ch9getstatus() function. In >>> this function, it starts a request by "borrowing" the per device >>> status_req, filling it in, and then queuing it with a call to >>> list_add_tail() to add the request to the endpoint queue. Right before it >>> exits the function however, it's calling ep0_prime_status(), which is >>> filling out that same status_req structure and then queuing it with another >>> call to list_add_tail() to add the request to the endpoint queue. This >>> adds two instances of the exact same LIST_HEAD to the endpoint queue, which >>> breaks the list since the prev and next pointers end up pointing to the >>> wrong things. This ends up causing a hard loop the next time nuke() gets >>> called, which happens on the next setup IRQ. >>> >>> I'm not sure what the appropriate fix to this problem is, mostly due to my >>> lack of expertise in USB and this driver stack. The code has been this way >>> in the kernel for a very long time, which suggests that it has been >>> working, unless USB_REQ_GET_STATUS requests are never made. This further >>> suggests that there is something else going on that I don't understand. >>> Deleting the call to ep0_prime_status() and the following ep0stall() call >>> appears, on the surface, to get the device working again, but may have side >>> effects that I'm not seeing. >>> >>> I'm hopeful someone in the community can help provide some information on >>> what I may be missing or help come up with a solution to the problem. A >>> big thank you to anyone who would like to help out. >>> >>> Eugene >> >> Run into this to a while ago. Found the bug and a few more fixes. >> This is against 4.19 so you may have to tweak them a bit. >> Feel free to upstream them. >> >> Jocke > > Curious, did my patches help? Good to known once we upgrade as well. > > Jocke There's good news and bad news. The good news is that this appears to stop the driver from entering an infinite loop, which prevents the Linux system from locking up and never recovering. So I'm willing to say we've made the behavior better. The bad news is that once we get past this point, there is new bad behavior. What is on top of this driver in our system is the RNDIS gadget driver communicating to a Laptop running Win10 -1809. Everything appears to work fine with the Linux system until there is a USB disconnect. After the disconnect, the Linux side appears to continue on just fine, but the Windows side doesn't seem to recognize the disconnect, which causes the USB driver on that side to hang forever and eventually blue screen the box. This doesn't happen on all machines, just a select few. I think we can isolate the behavior to a specific antivirus/security software driver that is inserting itself into the USB stack and filtering the disconnect message, but we're still proving that. I'm about 90% certain this is a different problem and we can call this patchset good, at least for our test setup. My only hesitation is if the Linux side is sending a set of responses that are confusing the Windows side (specifically this antivirus) or not. I'd be content calling that a separate defect though and letting this one close up with that patchset. Eugene
Re: [PATCH 6/7] KVM: powerpc: Use Makefile.kvm for common files
On Tue, 2021-11-16 at 18:43 +, Sean Christopherson wrote: > On Tue, Nov 16, 2021, David Woodhouse wrote: > > From: David Woodhouse > > > > It's all fairly baroque but in the end, I don't think there's any reason > > for $(KVM)/irqchip.o to have been handled differently, as they all end > > up in $(kvm-y) in the end anyway, regardless of whether they get there > > via $(common-objs-y) and the CPU-specific object lists. > > > > The generic Makefile.kvm uses HAVE_KVM_IRQCHIP for irqchip.o instead of > > HAVE_KVM_IRQ_ROUTING. That change is fine (and arguably correct) because > > they are both set together for KVM_MPIC, or neither is set. > > Nope. > > Symbol: HAVE_KVM_IRQCHIP [=y] > Type : bool > Defined at virt/kvm/Kconfig:7 > Selected by [m]: > - KVM_XICS [=y] && VIRTUALIZATION [=y] && KVM_BOOK3S_64 [=m] && !KVM_MPIC > [=n] > Selected by [n]: > - KVM_MPIC [=n] && VIRTUALIZATION [=y] && KVM [=y] && E500 [=n] > > leads to this and a whole pile of other errors > > arch/powerpc/kvm/../../../virt/kvm/irqchip.c: In function ‘kvm_irq_map_gsi’: > arch/powerpc/kvm/../../../virt/kvm/irqchip.c:31:35: error: invalid use of > undefined type ‘struct kvm_irq_routing_table’ >31 | if (irq_rt && gsi < irq_rt->nr_rt_entries) { > | ^~ > Hm, perhaps it should have been like this then (incremental): +++ b/virt/kvm/Makefile.kvm @@ -9,6 +9,6 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o -kvm-$(CONFIG_HAVE_KVM_IRQCHIP) += $(KVM)/irqchip.o +kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o > Side topic, please don't post a new version/series in-reply-to a different > series. > b4 also gets confused in this case, e.g. it tried to grab the original patch. > b4 > has also made me really lazy, heaven forbid I actually had to manually grab > these > from mutt :-) Sorry ;) I think that one might even be a new series in reply to what was already a second series on top of what I was *actually* trying to do when I first started shaving this yak. Or maybe what I was originally trying to implement has already been lost in the noise :) smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain
On 11/16/21 17:58, Marc Zyngier wrote: On Tue, 16 Nov 2021 13:40:22 +, Cédric Le Goater wrote: Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the 'nomap' domains still in use under the powerpc arch. With this new flag, the revmap_tree of the IRQ domain is not used anymore. This change broke the support of shared LSIs [1] in the XIVE driver because it was relying on a lookup in the revmap_tree to query previously mapped interrupts. Just a lookup? Surely there is more to it, no? nope. The HW IRQ for the INTx is defined in the DT. It is caught by of_irq_parse_and_map_pci() which simply adds an extra mapping on the same INTx since the previous one is not found. Using an INTx is quite rare now days and a shared one is even more uncommon I guess, I could only reproduced on the baremetal platform with the QEMU PowerNV machine using the same virtio devices. Thanks, C. Linux now creates two distinct IRQ mappings on the same HW IRQ which can lead to unexpected behavior in the drivers. The XIVE IRQ domain is not a direct mapping domain and its HW IRQ interrupt number space is rather large : 1M/socket on POWER9 and POWER10, change the XIVE driver to use a 'tree' domain type instead. [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon devices. Cc: Marc Zyngier Cc: sta...@vger.kernel.org # v5.14+ Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive") Signed-off-by: Cédric Le Goater --- Marc, The Fixes tag is there because the patch in question revealed that something was broken in XIVE. genirq is not in cause. However, I don't know for PS3 and Cell. May be less critical for now. Depends if they expect something that a no-map domain cannot provide.> arch/powerpc/sysdev/xive/common.c | 3 +-- arch/powerpc/sysdev/xive/Kconfig | 1 - 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index fed6fd16c8f4..9d0f0fe25598 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops = { static void __init xive_init_host(struct device_node *np) { - xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ, - _irq_domain_ops, NULL); + xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL); if (WARN_ON(xive_irq_domain == NULL)) return; irq_set_default_host(xive_irq_domain); diff --git a/arch/powerpc/sysdev/xive/Kconfig b/arch/powerpc/sysdev/xive/Kconfig index 97796c6b63f0..785c292d104b 100644 --- a/arch/powerpc/sysdev/xive/Kconfig +++ b/arch/powerpc/sysdev/xive/Kconfig @@ -3,7 +3,6 @@ config PPC_XIVE bool select PPC_SMP_MUXED_IPI select HARDIRQS_SW_RESEND - select IRQ_DOMAIN_NOMAP config PPC_XIVE_NATIVE bool As long as this works, I'm happy with one less no-map user. Acked-by: Marc Zyngier M.
Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain
On Tue, 16 Nov 2021 13:40:22 +, Cédric Le Goater wrote: > > Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains > exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the > 'nomap' domains still in use under the powerpc arch. With this new > flag, the revmap_tree of the IRQ domain is not used anymore. This > change broke the support of shared LSIs [1] in the XIVE driver because > it was relying on a lookup in the revmap_tree to query previously > mapped interrupts. Just a lookup? Surely there is more to it, no? > Linux now creates two distinct IRQ mappings on the > same HW IRQ which can lead to unexpected behavior in the drivers. > > The XIVE IRQ domain is not a direct mapping domain and its HW IRQ > interrupt number space is rather large : 1M/socket on POWER9 and > POWER10, change the XIVE driver to use a 'tree' domain type instead. > > [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon > devices. > > Cc: Marc Zyngier > Cc: sta...@vger.kernel.org # v5.14+ > Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive") > Signed-off-by: Cédric Le Goater > --- > > Marc, > > The Fixes tag is there because the patch in question revealed that > something was broken in XIVE. genirq is not in cause. However, I > don't know for PS3 and Cell. May be less critical for now. Depends if they expect something that a no-map domain cannot provide. > > arch/powerpc/sysdev/xive/common.c | 3 +-- > arch/powerpc/sysdev/xive/Kconfig | 1 - > 2 files changed, 1 insertion(+), 3 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/common.c > b/arch/powerpc/sysdev/xive/common.c > index fed6fd16c8f4..9d0f0fe25598 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops > = { > > static void __init xive_init_host(struct device_node *np) > { > - xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ, > -_irq_domain_ops, NULL); > + xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL); > if (WARN_ON(xive_irq_domain == NULL)) > return; > irq_set_default_host(xive_irq_domain); > diff --git a/arch/powerpc/sysdev/xive/Kconfig > b/arch/powerpc/sysdev/xive/Kconfig > index 97796c6b63f0..785c292d104b 100644 > --- a/arch/powerpc/sysdev/xive/Kconfig > +++ b/arch/powerpc/sysdev/xive/Kconfig > @@ -3,7 +3,6 @@ config PPC_XIVE > bool > select PPC_SMP_MUXED_IPI > select HARDIRQS_SW_RESEND > - select IRQ_DOMAIN_NOMAP > > config PPC_XIVE_NATIVE > bool As long as this works, I'm happy with one less no-map user. Acked-by: Marc Zyngier M. -- Without deviation from the norm, progress is not possible.
Re: [PATCH 4/5] KVM: x86: Use kvm_get_vcpu() instead of open-coded access
On 11/16/21 17:07, Sean Christopherson wrote: if (!kvm_arch_has_assigned_device(kvm) || !irq_remapping_cap(IRQ_POSTING_CAP) || - !kvm_vcpu_apicv_active(kvm->vcpus[0])) + !irqchip_in_kernel(kvm) || !enable_apicv) return 0; idx = srcu_read_lock(>irq_srcu); What happens then if pi_pre_block is called and the IRTE denotes a posted interrupt? I might be wrong, but it seems to me that you have to change all of the occurrences this way. As soon as enable_apicv is set, you need to go through the POSTED_INTR_WAKEUP_VECTOR just in case. Sorry, I didn't grok that at all. All occurences of what? Of the !assigned-device || !VTd-PI || !kvm_vcpu_apicv_active(vcpu) checks. This way, CPUs are woken up correctly even if you have !kvm_vcpu_apicv_active(vcpu) but the IRTE is a posted-interrupt one. Paolo
[PATCH v2 7/7] KVM: Convert kvm_for_each_vcpu() to using xa_for_each_range()
Now that the vcpu array is backed by an xarray, use the optimised iterator that matches the underlying data structure. Suggested-by: Sean Christopherson Signed-off-by: Marc Zyngier --- include/linux/kvm_host.h | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index dc635fbfd901..d20e33378c63 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -705,11 +705,9 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i) return xa_load(>vcpu_array, i); } -#define kvm_for_each_vcpu(idx, vcpup, kvm) \ - for (idx = 0; \ -idx < atomic_read(>online_vcpus) && \ -(vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \ -idx++) +#define kvm_for_each_vcpu(idx, vcpup, kvm)\ + xa_for_each_range(>vcpu_array, idx, vcpup, 0, \ + (atomic_read(>online_vcpus) - 1)) static inline struct kvm_vcpu *kvm_get_vcpu_by_id(struct kvm *kvm, int id) { -- 2.30.2
[PATCH v2 6/7] KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s index
Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu index. Unfortunately, we're about to move rework the iterator, which requires this to be upgrade to an unsigned long. Let's bite the bullet and repaint all of it in one go. Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arch_timer.c | 8 arch/arm64/kvm/arm.c | 6 +++--- arch/arm64/kvm/pmu-emul.c | 2 +- arch/arm64/kvm/psci.c | 6 +++--- arch/arm64/kvm/reset.c| 2 +- arch/arm64/kvm/vgic/vgic-init.c | 10 ++ arch/arm64/kvm/vgic/vgic-kvm-device.c | 2 +- arch/arm64/kvm/vgic/vgic-mmio-v2.c| 3 +-- arch/arm64/kvm/vgic/vgic-mmio-v3.c| 7 --- arch/arm64/kvm/vgic/vgic-v3.c | 4 ++-- arch/arm64/kvm/vgic/vgic-v4.c | 5 +++-- arch/arm64/kvm/vgic/vgic.c| 2 +- arch/powerpc/kvm/book3s_32_mmu.c | 2 +- arch/powerpc/kvm/book3s_64_mmu.c | 2 +- arch/powerpc/kvm/book3s_hv.c | 8 arch/powerpc/kvm/book3s_pr.c | 2 +- arch/powerpc/kvm/book3s_xics.c| 6 +++--- arch/powerpc/kvm/book3s_xics.h| 2 +- arch/powerpc/kvm/book3s_xive.c| 15 +-- arch/powerpc/kvm/book3s_xive.h| 4 ++-- arch/powerpc/kvm/book3s_xive_native.c | 8 arch/powerpc/kvm/e500_emulate.c | 2 +- arch/riscv/kvm/vcpu_sbi.c | 2 +- arch/riscv/kvm/vmid.c | 2 +- arch/s390/kvm/interrupt.c | 2 +- arch/s390/kvm/kvm-s390.c | 21 +++-- arch/s390/kvm/kvm-s390.h | 4 ++-- arch/x86/kvm/hyperv.c | 7 --- arch/x86/kvm/i8254.c | 2 +- arch/x86/kvm/i8259.c | 5 +++-- arch/x86/kvm/ioapic.c | 4 ++-- arch/x86/kvm/irq_comm.c | 7 --- arch/x86/kvm/kvm_onhyperv.c | 3 ++- arch/x86/kvm/lapic.c | 6 +++--- arch/x86/kvm/svm/avic.c | 2 +- arch/x86/kvm/svm/sev.c| 3 ++- arch/x86/kvm/x86.c| 21 +++-- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c | 13 +++-- 39 files changed, 114 insertions(+), 100 deletions(-) diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index 3df67c127489..d6f4114f1d11 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -750,7 +750,7 @@ int kvm_timer_vcpu_reset(struct kvm_vcpu *vcpu) /* Make the updates of cntvoff for all vtimer contexts atomic */ static void update_vtimer_cntvoff(struct kvm_vcpu *vcpu, u64 cntvoff) { - int i; + unsigned long i; struct kvm *kvm = vcpu->kvm; struct kvm_vcpu *tmp; @@ -1189,8 +1189,8 @@ void kvm_timer_vcpu_terminate(struct kvm_vcpu *vcpu) static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu) { - int vtimer_irq, ptimer_irq; - int i, ret; + int vtimer_irq, ptimer_irq, ret; + unsigned long i; vtimer_irq = vcpu_vtimer(vcpu)->irq.irq; ret = kvm_vgic_set_owner(vcpu, vtimer_irq, vcpu_vtimer(vcpu)); @@ -1297,7 +1297,7 @@ void kvm_timer_init_vhe(void) static void set_timer_irqs(struct kvm *kvm, int vtimer_irq, int ptimer_irq) { struct kvm_vcpu *vcpu; - int i; + unsigned long i; kvm_for_each_vcpu(i, vcpu, kvm) { vcpu_vtimer(vcpu)->irq.irq = vtimer_irq; diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 29942d8c357c..6177b0c882ea 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -624,7 +624,7 @@ bool kvm_arch_intc_initialized(struct kvm *kvm) void kvm_arm_halt_guest(struct kvm *kvm) { - int i; + unsigned long i; struct kvm_vcpu *vcpu; kvm_for_each_vcpu(i, vcpu, kvm) @@ -634,7 +634,7 @@ void kvm_arm_halt_guest(struct kvm *kvm) void kvm_arm_resume_guest(struct kvm *kvm) { - int i; + unsigned long i; struct kvm_vcpu *vcpu; kvm_for_each_vcpu(i, vcpu, kvm) { @@ -2020,7 +2020,7 @@ static int finalize_hyp_mode(void) struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr) { struct kvm_vcpu *vcpu; - int i; + unsigned long i; mpidr &= MPIDR_HWID_BITMASK; kvm_for_each_vcpu(i, vcpu, kvm) { diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index a5e4bbf5e68f..0404357705a8 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -900,7 +900,7 @@ static int kvm_arm_pmu_v3_init(struct kvm_vcpu *vcpu) */ static bool pmu_irq_is_valid(struct kvm *kvm, int irq) { - int i; + unsigned long i; struct kvm_vcpu *vcpu; kvm_for_each_vcpu(i, vcpu, kvm) { diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c index 74c47d420253..ed675fce8fb7 100644 --- a/arch/arm64/kvm/psci.c +++ b/arch/arm64/kvm/psci.c @@ -121,8 +121,8 @@ static unsigned long
[PATCH v2 5/7] KVM: Convert the kvm->vcpus array to a xarray
At least on arm64 and x86, the vcpus array is pretty huge (up to 1024 entries on x86) and is mostly empty in the majority of the cases (running 1k vcpu VMs is not that common). This mean that we end-up with a 4kB block of unused memory in the middle of the kvm structure. Instead of wasting away this memory, let's use an xarray instead, which gives us almost the same flexibility as a normal array, but with a reduced memory usage with smaller VMs. Signed-off-by: Marc Zyngier --- include/linux/kvm_host.h | 5 +++-- virt/kvm/kvm_main.c | 15 +-- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9a8c997ff9bc..df1b2b183d94 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -29,6 +29,7 @@ #include #include #include +#include #include #include @@ -552,7 +553,7 @@ struct kvm { struct mutex slots_arch_lock; struct mm_struct *mm; /* userspace tied to this vm */ struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM]; - struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; + struct xarray vcpu_array; /* Used to wait for completion of MMU notifiers. */ spinlock_t mn_invalidate_lock; @@ -701,7 +702,7 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i) /* Pairs with smp_wmb() in kvm_vm_ioctl_create_vcpu. */ smp_rmb(); - return kvm->vcpus[i]; + return xa_load(>vcpu_array, i); } #define kvm_for_each_vcpu(idx, vcpup, kvm) \ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 73811c3829a2..9a7bce944427 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -458,7 +458,7 @@ void kvm_destroy_vcpus(struct kvm *kvm) kvm_for_each_vcpu(i, vcpu, kvm) { kvm_vcpu_destroy(vcpu); - kvm->vcpus[i] = NULL; + xa_erase(>vcpu_array, i); } atomic_set(>online_vcpus, 0); @@ -1063,6 +1063,7 @@ static struct kvm *kvm_create_vm(unsigned long type) mutex_init(>slots_arch_lock); spin_lock_init(>mn_invalidate_lock); rcuwait_init(>mn_memslots_update_rcuwait); + xa_init(>vcpu_array); INIT_LIST_HEAD(>devices); @@ -3658,7 +3659,10 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id) } vcpu->vcpu_idx = atomic_read(>online_vcpus); - BUG_ON(kvm->vcpus[vcpu->vcpu_idx]); + r = xa_insert(>vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT); + BUG_ON(r == -EBUSY); + if (r) + goto unlock_vcpu_destroy; /* Fill the stats id string for the vcpu */ snprintf(vcpu->stats_id, sizeof(vcpu->stats_id), "kvm-%d/vcpu-%d", @@ -3668,15 +3672,14 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id) kvm_get_kvm(kvm); r = create_vcpu_fd(vcpu); if (r < 0) { + xa_erase(>vcpu_array, vcpu->vcpu_idx); kvm_put_kvm_no_destroy(kvm); goto unlock_vcpu_destroy; } - kvm->vcpus[vcpu->vcpu_idx] = vcpu; - /* -* Pairs with smp_rmb() in kvm_get_vcpu. Write kvm->vcpus -* before kvm->online_vcpu's incremented value. +* Pairs with smp_rmb() in kvm_get_vcpu. Store the vcpu +* pointer before kvm->online_vcpu's incremented value. */ smp_wmb(); atomic_inc(>online_vcpus); -- 2.30.2
[PATCH v2 4/7] KVM: x86: Use kvm_get_vcpu() instead of open-coded access
As we are about to change the way vcpus are allocated, mandate the use of kvm_get_vcpu() instead of open-coding the access. Signed-off-by: Marc Zyngier --- arch/x86/kvm/vmx/posted_intr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c index 5f81ef092bd4..82a49720727d 100644 --- a/arch/x86/kvm/vmx/posted_intr.c +++ b/arch/x86/kvm/vmx/posted_intr.c @@ -272,7 +272,7 @@ int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq, if (!kvm_arch_has_assigned_device(kvm) || !irq_remapping_cap(IRQ_POSTING_CAP) || - !kvm_vcpu_apicv_active(kvm->vcpus[0])) + !kvm_vcpu_apicv_active(kvm_get_vcpu(kvm, 0))) return 0; idx = srcu_read_lock(>irq_srcu); -- 2.30.2
[PATCH v2 3/7] KVM: s390: Use kvm_get_vcpu() instead of open-coded access
As we are about to change the way vcpus are allocated, mandate the use of kvm_get_vcpu() instead of open-coding the access. Reviewed-by: Claudio Imbrenda Signed-off-by: Marc Zyngier --- arch/s390/kvm/kvm-s390.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 7af53b8788fa..4a0f62b03964 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -4572,7 +4572,7 @@ int kvm_s390_vcpu_start(struct kvm_vcpu *vcpu) } for (i = 0; i < online_vcpus; i++) { - if (!is_vcpu_stopped(vcpu->kvm->vcpus[i])) + if (!is_vcpu_stopped(kvm_get_vcpu(vcpu->kvm, i))) started_vcpus++; } @@ -4634,9 +4634,11 @@ int kvm_s390_vcpu_stop(struct kvm_vcpu *vcpu) __disable_ibs_on_vcpu(vcpu); for (i = 0; i < online_vcpus; i++) { - if (!is_vcpu_stopped(vcpu->kvm->vcpus[i])) { + struct kvm_vcpu *tmp = kvm_get_vcpu(vcpu->kvm, i); + + if (!is_vcpu_stopped(tmp)) { started_vcpus++; - started_vcpu = vcpu->kvm->vcpus[i]; + started_vcpu = tmp; } } -- 2.30.2
[PATCH v2 2/7] KVM: mips: Use kvm_get_vcpu() instead of open-coded access
As we are about to change the way vcpus are allocated, mandate the use of kvm_get_vcpu() instead of open-coding the access. Reviewed-by: Philippe Mathieu-Daudé Signed-off-by: Marc Zyngier --- arch/mips/kvm/loongson_ipi.c | 4 ++-- arch/mips/kvm/mips.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/mips/kvm/loongson_ipi.c b/arch/mips/kvm/loongson_ipi.c index 3681fc8fba38..5d53f32d837c 100644 --- a/arch/mips/kvm/loongson_ipi.c +++ b/arch/mips/kvm/loongson_ipi.c @@ -120,7 +120,7 @@ static int loongson_vipi_write(struct loongson_kvm_ipi *ipi, s->status |= data; irq.cpu = id; irq.irq = 6; - kvm_vcpu_ioctl_interrupt(kvm->vcpus[id], ); + kvm_vcpu_ioctl_interrupt(kvm_get_vcpu(kvm, id), ); break; case CORE0_CLEAR_OFF: @@ -128,7 +128,7 @@ static int loongson_vipi_write(struct loongson_kvm_ipi *ipi, if (!s->status) { irq.cpu = id; irq.irq = -6; - kvm_vcpu_ioctl_interrupt(kvm->vcpus[id], ); + kvm_vcpu_ioctl_interrupt(kvm_get_vcpu(kvm, id), ); } break; diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index ceacca74f808..6228bf396d63 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -479,7 +479,7 @@ int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, if (irq->cpu == -1) dvcpu = vcpu; else - dvcpu = vcpu->kvm->vcpus[irq->cpu]; + dvcpu = kvm_get_vcpu(vcpu->kvm, irq->cpu); if (intr == 2 || intr == 3 || intr == 4 || intr == 6) { kvm_mips_callbacks->queue_io_int(dvcpu, irq); -- 2.30.2
[PATCH v2 1/7] KVM: Move wiping of the kvm->vcpus array to common code
All architectures have similar loops iterating over the vcpus, freeing one vcpu at a time, and eventually wiping the reference off the vcpus array. They are also inconsistently taking the kvm->lock mutex when wiping the references from the array. Make this code common, which will simplify further changes. The locking is dropped altogether, as this should only be called when there is no further references on the kvm structure. Reviewed-by: Claudio Imbrenda Signed-off-by: Marc Zyngier --- arch/arm64/kvm/arm.c | 10 +- arch/mips/kvm/mips.c | 21 + arch/powerpc/kvm/powerpc.c | 10 +- arch/riscv/kvm/vm.c| 10 +- arch/s390/kvm/kvm-s390.c | 18 +- arch/x86/kvm/x86.c | 9 + include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c| 17 +++-- 8 files changed, 22 insertions(+), 75 deletions(-) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 2f03cbfefe67..29942d8c357c 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -175,19 +175,11 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) */ void kvm_arch_destroy_vm(struct kvm *kvm) { - int i; - bitmap_free(kvm->arch.pmu_filter); kvm_vgic_destroy(kvm); - for (i = 0; i < KVM_MAX_VCPUS; ++i) { - if (kvm->vcpus[i]) { - kvm_vcpu_destroy(kvm->vcpus[i]); - kvm->vcpus[i] = NULL; - } - } - atomic_set(>online_vcpus, 0); + kvm_destroy_vcpus(kvm); } int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index 562aa878b266..ceacca74f808 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -171,25 +171,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) return 0; } -void kvm_mips_free_vcpus(struct kvm *kvm) -{ - unsigned int i; - struct kvm_vcpu *vcpu; - - kvm_for_each_vcpu(i, vcpu, kvm) { - kvm_vcpu_destroy(vcpu); - } - - mutex_lock(>lock); - - for (i = 0; i < atomic_read(>online_vcpus); i++) - kvm->vcpus[i] = NULL; - - atomic_set(>online_vcpus, 0); - - mutex_unlock(>lock); -} - static void kvm_mips_free_gpa_pt(struct kvm *kvm) { /* It should always be safe to remove after flushing the whole range */ @@ -199,7 +180,7 @@ static void kvm_mips_free_gpa_pt(struct kvm *kvm) void kvm_arch_destroy_vm(struct kvm *kvm) { - kvm_mips_free_vcpus(kvm); + kvm_destroy_vcpus(kvm); kvm_mips_free_gpa_pt(kvm); } diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 35e9cccdeef9..492e4a4121cb 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -463,9 +463,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) void kvm_arch_destroy_vm(struct kvm *kvm) { - unsigned int i; - struct kvm_vcpu *vcpu; - #ifdef CONFIG_KVM_XICS /* * We call kick_all_cpus_sync() to ensure that all @@ -476,14 +473,9 @@ void kvm_arch_destroy_vm(struct kvm *kvm) kick_all_cpus_sync(); #endif - kvm_for_each_vcpu(i, vcpu, kvm) - kvm_vcpu_destroy(vcpu); + kvm_destroy_vcpus(kvm); mutex_lock(>lock); - for (i = 0; i < atomic_read(>online_vcpus); i++) - kvm->vcpus[i] = NULL; - - atomic_set(>online_vcpus, 0); kvmppc_core_destroy_vm(kvm); diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c index 26399df15b63..6af6cde295eb 100644 --- a/arch/riscv/kvm/vm.c +++ b/arch/riscv/kvm/vm.c @@ -46,15 +46,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) void kvm_arch_destroy_vm(struct kvm *kvm) { - int i; - - for (i = 0; i < KVM_MAX_VCPUS; ++i) { - if (kvm->vcpus[i]) { - kvm_vcpu_destroy(kvm->vcpus[i]); - kvm->vcpus[i] = NULL; - } - } - atomic_set(>online_vcpus, 0); + kvm_destroy_vcpus(kvm); } int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index c6257f625929..7af53b8788fa 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -2819,27 +2819,11 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) free_page((unsigned long)(vcpu->arch.sie_block)); } -static void kvm_free_vcpus(struct kvm *kvm) -{ - unsigned int i; - struct kvm_vcpu *vcpu; - - kvm_for_each_vcpu(i, vcpu, kvm) - kvm_vcpu_destroy(vcpu); - - mutex_lock(>lock); - for (i = 0; i < atomic_read(>online_vcpus); i++) - kvm->vcpus[i] = NULL; - - atomic_set(>online_vcpus, 0); - mutex_unlock(>lock); -} - void kvm_arch_destroy_vm(struct kvm *kvm) { u16 rc, rrc; - kvm_free_vcpus(kvm); +
[PATCH v2 0/7] KVM: Turn the vcpu array into an xarray
The kvm structure is pretty large. A large portion of it is the vcpu array, which is 4kB on arm64 with 512 vcpu, double that on x86-64. Of course, hardly anyone runs VMs this big, so this is often a net waste of memory and cache locality. A possible approach is to turn the fixed-size array into an xarray, which results in a net code deletion after a bit of cleanup. * From v1: - Rebased on v5.16-rc1 - Dropped the dubious locking on teardown - Converted kvm_for_each_vcpu() to xa_for_each_range(), together with an invasive change converting the index to an unsigned long Marc Zyngier (7): KVM: Move wiping of the kvm->vcpus array to common code KVM: mips: Use kvm_get_vcpu() instead of open-coded access KVM: s390: Use kvm_get_vcpu() instead of open-coded access KVM: x86: Use kvm_get_vcpu() instead of open-coded access KVM: Convert the kvm->vcpus array to a xarray KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s index KVM: Convert kvm_for_each_vcpu() to using xa_for_each_range() arch/arm64/kvm/arch_timer.c | 8 ++--- arch/arm64/kvm/arm.c | 16 +++-- arch/arm64/kvm/pmu-emul.c | 2 +- arch/arm64/kvm/psci.c | 6 ++-- arch/arm64/kvm/reset.c| 2 +- arch/arm64/kvm/vgic/vgic-init.c | 10 +++--- arch/arm64/kvm/vgic/vgic-kvm-device.c | 2 +- arch/arm64/kvm/vgic/vgic-mmio-v2.c| 3 +- arch/arm64/kvm/vgic/vgic-mmio-v3.c| 7 ++-- arch/arm64/kvm/vgic/vgic-v3.c | 4 +-- arch/arm64/kvm/vgic/vgic-v4.c | 5 +-- arch/arm64/kvm/vgic/vgic.c| 2 +- arch/mips/kvm/loongson_ipi.c | 4 +-- arch/mips/kvm/mips.c | 23 ++--- arch/powerpc/kvm/book3s_32_mmu.c | 2 +- arch/powerpc/kvm/book3s_64_mmu.c | 2 +- arch/powerpc/kvm/book3s_hv.c | 8 ++--- arch/powerpc/kvm/book3s_pr.c | 2 +- arch/powerpc/kvm/book3s_xics.c| 6 ++-- arch/powerpc/kvm/book3s_xics.h| 2 +- arch/powerpc/kvm/book3s_xive.c| 15 + arch/powerpc/kvm/book3s_xive.h| 4 +-- arch/powerpc/kvm/book3s_xive_native.c | 8 ++--- arch/powerpc/kvm/e500_emulate.c | 2 +- arch/powerpc/kvm/powerpc.c| 10 +- arch/riscv/kvm/vcpu_sbi.c | 2 +- arch/riscv/kvm/vm.c | 10 +- arch/riscv/kvm/vmid.c | 2 +- arch/s390/kvm/interrupt.c | 2 +- arch/s390/kvm/kvm-s390.c | 47 ++- arch/s390/kvm/kvm-s390.h | 4 +-- arch/x86/kvm/hyperv.c | 7 ++-- arch/x86/kvm/i8254.c | 2 +- arch/x86/kvm/i8259.c | 5 +-- arch/x86/kvm/ioapic.c | 4 +-- arch/x86/kvm/irq_comm.c | 7 ++-- arch/x86/kvm/kvm_onhyperv.c | 3 +- arch/x86/kvm/lapic.c | 6 ++-- arch/x86/kvm/svm/avic.c | 2 +- arch/x86/kvm/svm/sev.c| 3 +- arch/x86/kvm/vmx/posted_intr.c| 2 +- arch/x86/kvm/x86.c| 30 +++-- include/linux/kvm_host.h | 17 +- virt/kvm/kvm_main.c | 41 --- 44 files changed, 158 insertions(+), 193 deletions(-) -- 2.30.2
Re: [PATCH 0/5] KVM: Turn the vcpu array into an xarray
On Tue, 16 Nov 2021 15:03:40 +, Paolo Bonzini wrote: > > On 11/5/21 20:20, Marc Zyngier wrote: > > The kvm structure is pretty large. A large portion of it is the vcpu > > array, which is 4kB on x86_64 and arm64 as they deal with 512 vcpu > > VMs. Of course, hardly anyone runs VMs this big, so this is often a > > net waste of memory and cache locality. > > > > A possible approach is to turn the fixed-size array into an xarray, > > which results in a net code deletion after a bit of cleanup. > > > > This series is on top of the current linux/master as it touches the > > RISC-V implementation. Only tested on arm64. > > Queued, only locally until I get a review for my replacement of patch > 4 (see > https://lore.kernel.org/kvm/2026142205.719375-1-pbonz...@redhat.com/T/). In which case, let me send a v2 with the changes that we discussed with Sean. It will still have my version of patch 4, but that's nothing you can't fix. M. -- Without deviation from the norm, progress is not possible.
Re: [PATCH] powerpc: clean vdso32 and vdso64 directories
Hi Masahiro, Le 09/11/2021 à 19:50, Masahiro Yamada a écrit : Since commit bce74491c300 ("powerpc/vdso: fix unnecessary rebuilds of vgettimeofday.o"), "make ARCH=powerpc clean" does not clean up the arch/powerpc/kernel/{vdso32,vdso64} directories. Use the subdir- trick to let "make clean" descend into them. Fixes: bce74491c300 ("powerpc/vdso: fix unnecessary rebuilds of vgettimeofday.o") Signed-off-by: Masahiro Yamada --- arch/powerpc/kernel/Makefile | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 0e3640e14eb1..5fa68c2ef1f8 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -196,3 +196,6 @@ clean-files := vmlinux.lds # Force dependency (incbin is bad) $(obj)/vdso32_wrapper.o : $(obj)/vdso32/vdso32.so.dbg $(obj)/vdso64_wrapper.o : $(obj)/vdso64/vdso64.so.dbg + +# for cleaning +subdir- += vdso32 vdso64 This patch make me think about one thing I would have liked to do, but I don't know Makefiles well enough to be able to do it. You could probably help me with it. vdso32 and vdso64 contain a lot of redundant sources. I would like to merge them into a new single directory, let say 'vdso', and use the files in that directory to build both vdso32.so and vdso64.so. I have a feeling that x86 is doing it that way, but I've not been able to figure out how to build two objects using the same C/S files. Thanks Christophe
Re: [PATCH v2 12/13] lkdtm: Fix execute_[user]_location()
Hi Kees, Le 16/10/2021 à 08:42, Christophe Leroy a écrit : Le 15/10/2021 à 23:31, Kees Cook a écrit : On Thu, Oct 14, 2021 at 07:50:01AM +0200, Christophe Leroy wrote: execute_location() and execute_user_location() intent to copy do_nothing() text and execute it at a new location. However, at the time being it doesn't copy do_nothing() function but do_nothing() function descriptor which still points to the original text. So at the end it still executes do_nothing() at its original location allthough using a copied function descriptor. So, fix that by really copying do_nothing() text and build a new function descriptor by copying do_nothing() function descriptor and updating the target address with the new location. Also fix the displayed addresses by dereferencing do_nothing() function descriptor. Signed-off-by: Christophe Leroy --- drivers/misc/lkdtm/perms.c | 25 + include/asm-generic/sections.h | 5 + 2 files changed, 26 insertions(+), 4 deletions(-) diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c index 5266dc28df6e..96b3ebfcb8ed 100644 --- a/drivers/misc/lkdtm/perms.c +++ b/drivers/misc/lkdtm/perms.c @@ -44,19 +44,32 @@ static noinline void do_overwritten(void) return; } +static void *setup_function_descriptor(func_desc_t *fdesc, void *dst) +{ + memcpy(fdesc, do_nothing, sizeof(*fdesc)); + fdesc->addr = (unsigned long)dst; + barrier(); + + return fdesc; +} How about collapsing the "have_function_descriptors()" check into setup_function_descriptor()? static void *setup_function_descriptor(func_desc_t *fdesc, void *dst) { if (__is_defined(HAVE_FUNCTION_DESCRIPTORS)) { memcpy(fdesc, do_nothing, sizeof(*fdesc)); fdesc->addr = (unsigned long)dst; barrier(); return fdesc; } else { return dst; } } Ok ... diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 76163883c6ff..d225318538bd 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -70,6 +70,11 @@ typedef struct { } func_desc_t; #endif +static inline bool have_function_descriptors(void) +{ + return __is_defined(HAVE_FUNCTION_DESCRIPTORS); +} + /* random extra sections (if any). Override * in asm/sections.h */ #ifndef arch_is_kernel_text This hunk seems like it should live in a separate patch. Ok I move it in a previous patch. Do you have any additional feedback or comment on series v3 ? What's the way forward, should it go via LKDTM tree or via powerpc tree or another tree ? I see there are neither Ack-by nor Reviewed-by for the last 2 patches. Thanks Christophe
Re: [PATCH 0/5] KVM: Turn the vcpu array into an xarray
On 11/5/21 20:20, Marc Zyngier wrote: The kvm structure is pretty large. A large portion of it is the vcpu array, which is 4kB on x86_64 and arm64 as they deal with 512 vcpu VMs. Of course, hardly anyone runs VMs this big, so this is often a net waste of memory and cache locality. A possible approach is to turn the fixed-size array into an xarray, which results in a net code deletion after a bit of cleanup. This series is on top of the current linux/master as it touches the RISC-V implementation. Only tested on arm64. Queued, only locally until I get a review for my replacement of patch 4 (see https://lore.kernel.org/kvm/2026142205.719375-1-pbonz...@redhat.com/T/). Paolo Marc Zyngier (5): KVM: Move wiping of the kvm->vcpus array to common code KVM: mips: Use kvm_get_vcpu() instead of open-coded access KVM: s390: Use kvm_get_vcpu() instead of open-coded access KVM: x86: Use kvm_get_vcpu() instead of open-coded access KVM: Convert the kvm->vcpus array to a xarray arch/arm64/kvm/arm.c | 10 +- arch/mips/kvm/loongson_ipi.c | 4 ++-- arch/mips/kvm/mips.c | 23 ++- arch/powerpc/kvm/powerpc.c | 10 +- arch/riscv/kvm/vm.c| 10 +- arch/s390/kvm/kvm-s390.c | 26 ++ arch/x86/kvm/vmx/posted_intr.c | 2 +- arch/x86/kvm/x86.c | 9 + include/linux/kvm_host.h | 7 --- virt/kvm/kvm_main.c| 33 ++--- 10 files changed, 45 insertions(+), 89 deletions(-)
Re: [RESEND PATCH v5 0/4] Add perf interface to expose nvdimm
Hi Le 16/11/2021 à 05:49, Kajol Jain a écrit : > Patchset adds performance stats reporting support for nvdimm. > Added interface includes support for pmu register/unregister > functions. A structure is added called nvdimm_pmu to be used for > adding arch/platform specific data such as cpumask, nvdimm device > pointer and pmu event functions like event_init/add/read/del. > User could use the standard perf tool to access perf events > exposed via pmu. > > Interface also defines supported event list, config fields for the > event attributes and their corresponding bit values which are exported > via sysfs. Patch 3 exposes IBM pseries platform nmem* device > performance stats using this interface. You resending your v5 series ? Is there any news since your submittion last month that's awaiting in patchwork here at https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=264422 ? Christophe > > Result from power9 pseries lpar with 2 nvdimm device: > > Ex: List all event by perf list > > command:# perf list nmem > >nmem0/cache_rh_cnt/[Kernel PMU event] >nmem0/cache_wh_cnt/[Kernel PMU event] >nmem0/cri_res_util/[Kernel PMU event] >nmem0/ctl_res_cnt/ [Kernel PMU event] >nmem0/ctl_res_tm/ [Kernel PMU event] >nmem0/fast_w_cnt/ [Kernel PMU event] >nmem0/host_l_cnt/ [Kernel PMU event] >nmem0/host_l_dur/ [Kernel PMU event] >nmem0/host_s_cnt/ [Kernel PMU event] >nmem0/host_s_dur/ [Kernel PMU event] >nmem0/med_r_cnt/ [Kernel PMU event] >nmem0/med_r_dur/ [Kernel PMU event] >nmem0/med_w_cnt/ [Kernel PMU event] >nmem0/med_w_dur/ [Kernel PMU event] >nmem0/mem_life/[Kernel PMU event] >nmem0/poweron_secs/[Kernel PMU event] >... >nmem1/mem_life/[Kernel PMU event] >nmem1/poweron_secs/[Kernel PMU event] > > Patch1: > Introduces the nvdimm_pmu structure > Patch2: > Adds common interface to add arch/platform specific data > includes nvdimm device pointer, pmu data along with > pmu event functions. It also defines supported event list > and adds attribute groups for format, events and cpumask. > It also adds code for cpu hotplug support. > Patch3: > Add code in arch/powerpc/platform/pseries/papr_scm.c to expose > nmem* pmu. It fills in the nvdimm_pmu structure with pmu name, > capabilities, cpumask and event functions and then registers > the pmu by adding callbacks to register_nvdimm_pmu. > Patch4: > Sysfs documentation patch > > Changelog > --- > v4 -> v5: > - Remove multiple variables defined in nvdimm_pmu structure include >name and pmu functions(event_int/add/del/read) as they are just >used to copy them again in pmu variable. Now we are directly doing >this step in arch specific code as suggested by Dan Williams. > > - Remove attribute group field from nvdimm pmu structure and >defined these attribute groups in common interface which >includes format, event list along with cpumask as suggested by >Dan Williams. >Since we added static defination for attrbute groups needed in >common interface, removes corresponding code from papr. > > - Add nvdimm pmu event list with event codes in the common interface. > > - Remove Acked-by/Reviewed-by/Tested-by tags as code is refactored >to handle review comments from Dan. > > - Make nvdimm_pmu_free_hotplug_memory function static as reported >by kernel test robot, also add corresponding Reported-by tag. > > - Link to the patchset v4: https://lkml.org/lkml/2021/9/3/45 > > v3 -> v4 > - Rebase code on top of current papr_scm code without any logical >changes. > > - Added Acked-by tag from Peter Zijlstra and Reviewed by tag >from Madhavan Srinivasan. > > - Link to the patchset v3: https://lkml.org/lkml/2021/6/17/605 > > v2 -> v3 > - Added Tested-by tag. > > - Fix nvdimm mailing list in the ABI Documentation. > > - Link to the patchset v2: https://lkml.org/lkml/2021/6/14/25 > > v1 -> v2 > - Fix hotplug code by adding pmu migration call >incase current designated cpu got offline. As >pointed by Peter Zijlstra. > > - Removed the retun -1 part from cpu hotplug offline >function. > > - Link to the patchset v1: https://lkml.org/lkml/2021/6/8/500 > > Kajol Jain (4): >drivers/nvdimm: Add nvdimm pmu structure >drivers/nvdimm: Add perf interface
Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain
On 11/16/21 15:23, Greg Kurz wrote: On Tue, 16 Nov 2021 14:40:22 +0100 Cédric Le Goater wrote: Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the 'nomap' domains still in use under the powerpc arch. With this new flag, the revmap_tree of the IRQ domain is not used anymore. This change broke the support of shared LSIs [1] in the XIVE driver because it was relying on a lookup in the revmap_tree to query previously mapped interrupts. Linux now creates two distinct IRQ mappings on the same HW IRQ which can lead to unexpected behavior in the drivers. The XIVE IRQ domain is not a direct mapping domain and its HW IRQ interrupt number space is rather large : 1M/socket on POWER9 and POWER10, change the XIVE driver to use a 'tree' domain type instead. [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon devices. Cc: Marc Zyngier Cc: sta...@vger.kernel.org # v5.14+ Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive") Signed-off-by: Cédric Le Goater --- Tested-by: Greg Kurz with a KVM guest + virtio-rng + virtio-balloon on a POWER9 host. Did you test on a 5.14 backport or mainline ? I am asking because a large change adding support for MSI domains to XIVE was merged in 5.15. Thanks, C. Marc, The Fixes tag is there because the patch in question revealed that something was broken in XIVE. genirq is not in cause. However, I don't know for PS3 and Cell. May be less critical for now. arch/powerpc/sysdev/xive/common.c | 3 +-- arch/powerpc/sysdev/xive/Kconfig | 1 - 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index fed6fd16c8f4..9d0f0fe25598 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops = { static void __init xive_init_host(struct device_node *np) { - xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ, - _irq_domain_ops, NULL); + xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL); if (WARN_ON(xive_irq_domain == NULL)) return; irq_set_default_host(xive_irq_domain); diff --git a/arch/powerpc/sysdev/xive/Kconfig b/arch/powerpc/sysdev/xive/Kconfig index 97796c6b63f0..785c292d104b 100644 --- a/arch/powerpc/sysdev/xive/Kconfig +++ b/arch/powerpc/sysdev/xive/Kconfig @@ -3,7 +3,6 @@ config PPC_XIVE bool select PPC_SMP_MUXED_IPI select HARDIRQS_SW_RESEND - select IRQ_DOMAIN_NOMAP config PPC_XIVE_NATIVE bool
Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain
On Tue, 16 Nov 2021 15:49:13 +0100 Cédric Le Goater wrote: > On 11/16/21 15:23, Greg Kurz wrote: > > On Tue, 16 Nov 2021 14:40:22 +0100 > > Cédric Le Goater wrote: > > > >> Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains > >> exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the > >> 'nomap' domains still in use under the powerpc arch. With this new > >> flag, the revmap_tree of the IRQ domain is not used anymore. This > >> change broke the support of shared LSIs [1] in the XIVE driver because > >> it was relying on a lookup in the revmap_tree to query previously > >> mapped interrupts. Linux now creates two distinct IRQ mappings on the > >> same HW IRQ which can lead to unexpected behavior in the drivers. > >> > >> The XIVE IRQ domain is not a direct mapping domain and its HW IRQ > >> interrupt number space is rather large : 1M/socket on POWER9 and > >> POWER10, change the XIVE driver to use a 'tree' domain type instead. > >> > >> [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon > >> devices. > >> > >> Cc: Marc Zyngier > >> Cc: sta...@vger.kernel.org # v5.14+ > >> Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains > >> exclusive") > >> Signed-off-by: Cédric Le Goater > >> --- > >> > > > > Tested-by: Greg Kurz > > > > with a KVM guest + virtio-rng + virtio-balloon on a POWER9 host. > > Did you test on a 5.14 backport or mainline ? > I've tested on a 5.14 backport only. > I am asking because a large change adding support for MSI domains > to XIVE was merged in 5.15. > > Thanks, > > C. > > > > > >> Marc, > >> > >> The Fixes tag is there because the patch in question revealed that > >> something was broken in XIVE. genirq is not in cause. However, I > >> don't know for PS3 and Cell. May be less critical for now. > >> > >> arch/powerpc/sysdev/xive/common.c | 3 +-- > >> arch/powerpc/sysdev/xive/Kconfig | 1 - > >> 2 files changed, 1 insertion(+), 3 deletions(-) > >> > >> diff --git a/arch/powerpc/sysdev/xive/common.c > >> b/arch/powerpc/sysdev/xive/common.c > >> index fed6fd16c8f4..9d0f0fe25598 100644 > >> --- a/arch/powerpc/sysdev/xive/common.c > >> +++ b/arch/powerpc/sysdev/xive/common.c > >> @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops > >> xive_irq_domain_ops = { > >> > >> static void __init xive_init_host(struct device_node *np) > >> { > >> - xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ, > >> - _irq_domain_ops, NULL); > >> + xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL); > >>if (WARN_ON(xive_irq_domain == NULL)) > >>return; > >>irq_set_default_host(xive_irq_domain); > >> diff --git a/arch/powerpc/sysdev/xive/Kconfig > >> b/arch/powerpc/sysdev/xive/Kconfig > >> index 97796c6b63f0..785c292d104b 100644 > >> --- a/arch/powerpc/sysdev/xive/Kconfig > >> +++ b/arch/powerpc/sysdev/xive/Kconfig > >> @@ -3,7 +3,6 @@ config PPC_XIVE > >>bool > >>select PPC_SMP_MUXED_IPI > >>select HARDIRQS_SW_RESEND > >> - select IRQ_DOMAIN_NOMAP > >> > >> config PPC_XIVE_NATIVE > >>bool > > >
Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain
On Tue, 16 Nov 2021 14:40:22 +0100 Cédric Le Goater wrote: > Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains > exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the > 'nomap' domains still in use under the powerpc arch. With this new > flag, the revmap_tree of the IRQ domain is not used anymore. This > change broke the support of shared LSIs [1] in the XIVE driver because > it was relying on a lookup in the revmap_tree to query previously > mapped interrupts. Linux now creates two distinct IRQ mappings on the > same HW IRQ which can lead to unexpected behavior in the drivers. > > The XIVE IRQ domain is not a direct mapping domain and its HW IRQ > interrupt number space is rather large : 1M/socket on POWER9 and > POWER10, change the XIVE driver to use a 'tree' domain type instead. > > [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon > devices. > > Cc: Marc Zyngier > Cc: sta...@vger.kernel.org # v5.14+ > Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive") > Signed-off-by: Cédric Le Goater > --- > Tested-by: Greg Kurz with a KVM guest + virtio-rng + virtio-balloon on a POWER9 host. > Marc, > > The Fixes tag is there because the patch in question revealed that > something was broken in XIVE. genirq is not in cause. However, I > don't know for PS3 and Cell. May be less critical for now. > > arch/powerpc/sysdev/xive/common.c | 3 +-- > arch/powerpc/sysdev/xive/Kconfig | 1 - > 2 files changed, 1 insertion(+), 3 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/common.c > b/arch/powerpc/sysdev/xive/common.c > index fed6fd16c8f4..9d0f0fe25598 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops > = { > > static void __init xive_init_host(struct device_node *np) > { > - xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ, > -_irq_domain_ops, NULL); > + xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL); > if (WARN_ON(xive_irq_domain == NULL)) > return; > irq_set_default_host(xive_irq_domain); > diff --git a/arch/powerpc/sysdev/xive/Kconfig > b/arch/powerpc/sysdev/xive/Kconfig > index 97796c6b63f0..785c292d104b 100644 > --- a/arch/powerpc/sysdev/xive/Kconfig > +++ b/arch/powerpc/sysdev/xive/Kconfig > @@ -3,7 +3,6 @@ config PPC_XIVE > bool > select PPC_SMP_MUXED_IPI > select HARDIRQS_SW_RESEND > - select IRQ_DOMAIN_NOMAP > > config PPC_XIVE_NATIVE > bool
Re: [PATCH 0/5] KVM: Turn the vcpu array into an xarray
On 11/16/21 15:13, Juergen Gross wrote: On 05.11.21 20:20, Marc Zyngier wrote: The kvm structure is pretty large. A large portion of it is the vcpu array, which is 4kB on x86_64 and arm64 as they deal with 512 vcpu VMs. Of course, hardly anyone runs VMs this big, so this is often a net waste of memory and cache locality. A possible approach is to turn the fixed-size array into an xarray, which results in a net code deletion after a bit of cleanup. This series is on top of the current linux/master as it touches the RISC-V implementation. Only tested on arm64. Marc Zyngier (5): KVM: Move wiping of the kvm->vcpus array to common code KVM: mips: Use kvm_get_vcpu() instead of open-coded access KVM: s390: Use kvm_get_vcpu() instead of open-coded access KVM: x86: Use kvm_get_vcpu() instead of open-coded access KVM: Convert the kvm->vcpus array to a xarray arch/arm64/kvm/arm.c | 10 +- arch/mips/kvm/loongson_ipi.c | 4 ++-- arch/mips/kvm/mips.c | 23 ++- arch/powerpc/kvm/powerpc.c | 10 +- arch/riscv/kvm/vm.c | 10 +- arch/s390/kvm/kvm-s390.c | 26 ++ arch/x86/kvm/vmx/posted_intr.c | 2 +- arch/x86/kvm/x86.c | 9 + include/linux/kvm_host.h | 7 --- virt/kvm/kvm_main.c | 33 ++--- 10 files changed, 45 insertions(+), 89 deletions(-) For x86 you can add my: Tested-by: Juergen Gross Heh, unfortunately x86 is the only one that needs a change in patch 4. I'll Cc you on my version. Paolo
Re: [PATCH 4/5] KVM: x86: Use kvm_get_vcpu() instead of open-coded access
On 11/5/21 21:03, Sean Christopherson wrote: But I think even that is flawed, as APICv can be dynamically deactivated and re-activated while the VM is running, and I don't see a path that re-updates the IRTE when APICv is re-activated. So I think a more conservative check is needed, e.g. diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c index 5f81ef092bd4..6cf5b2e86118 100644 --- a/arch/x86/kvm/vmx/posted_intr.c +++ b/arch/x86/kvm/vmx/posted_intr.c @@ -272,7 +272,7 @@ int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq, if (!kvm_arch_has_assigned_device(kvm) || !irq_remapping_cap(IRQ_POSTING_CAP) || - !kvm_vcpu_apicv_active(kvm->vcpus[0])) + !irqchip_in_kernel(kvm) || !enable_apicv) return 0; idx = srcu_read_lock(>irq_srcu); What happens then if pi_pre_block is called and the IRTE denotes a posted interrupt? I might be wrong, but it seems to me that you have to change all of the occurrences this way. As soon as enable_apicv is set, you need to go through the POSTED_INTR_WAKEUP_VECTOR just in case. Paolo
Re: [PATCH 1/5] KVM: Move wiping of the kvm->vcpus array to common code
On 11/6/21 12:17, Marc Zyngier wrote: If you too believe that this is just wrong, I'm happy to drop the locking altogether. If that breaks someone's flow, they'll shout soon enough. Yes, it's not necessary. It was added in 2009 (commit 988a2cae6a3c, "KVM: Use macro to iterate over vcpus.") and it was unnecessary back then too. Paolo
Re: [PATCH v2 3/5] powerpc: Use preemption model accessors
Le 10/11/2021 à 21:24, Valentin Schneider a écrit : Per PREEMPT_DYNAMIC, checking CONFIG_PREEMPT doesn't tell you the actual preemption model of the live kernel. Use the newly-introduced accessors instead. Is that change worth it for now ? As far as I can see powerpc doesn't have DYNAMIC PREEMPT, a lot of work needs to be done before being able to use it: - Implement GENERIC_ENTRY - Implement STATIC_CALLS (already done on PPC32, to be done on PPC64) sched_init() -> preempt_dynamic_init() happens way before IRQs are set up, so this should be fine. It looks like you are mixing up interrupts and IRQs (also known as "external interrupts"). ISI (Instruction Storage Interrupt) and DSI (Data Storage Interrupt) for instance are also interrupts. They happen everytime there is a page fault so may happen pretty early. Traps generated by WARN_ON() are also interrupts that may happen at any time. Signed-off-by: Valentin Schneider --- arch/powerpc/kernel/interrupt.c | 2 +- arch/powerpc/kernel/traps.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c index de10a2697258..c56c10b59be3 100644 --- a/arch/powerpc/kernel/interrupt.c +++ b/arch/powerpc/kernel/interrupt.c @@ -552,7 +552,7 @@ notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs) /* Returning to a kernel context with local irqs enabled. */ WARN_ON_ONCE(!(regs->msr & MSR_EE)); again: - if (IS_ENABLED(CONFIG_PREEMPT)) { + if (is_preempt_full()) { I think the cost of that additionnal test should be analysed. Maybe it's worth not doing the test at all and check _TIF_NEED_RESCHED everytime, unless that recurrent test is changed into a jump label as suggested in patch 2. /* Return to preemptible kernel context */ if (unlikely(current_thread_info()->flags & _TIF_NEED_RESCHED)) { if (preempt_count() == 0) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index aac8c0412ff9..1cb31bbdc925 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -265,7 +265,7 @@ static int __die(const char *str, struct pt_regs *regs, long err) printk("%s PAGE_SIZE=%luK%s%s%s%s%s%s %s\n", IS_ENABLED(CONFIG_CPU_LITTLE_ENDIAN) ? "LE" : "BE", PAGE_SIZE / 1024, get_mmu_str(), - IS_ENABLED(CONFIG_PREEMPT) ? " PREEMPT" : "", + is_preempt_full() ? " PREEMPT" : "", IS_ENABLED(CONFIG_SMP) ? " SMP" : "", IS_ENABLED(CONFIG_SMP) ? (" NR_CPUS=" __stringify(NR_CPUS)) : "", debug_pagealloc_enabled() ? " DEBUG_PAGEALLOC" : "", Would it be interesting as well to know that we are indeed in a DYNAMIC PREEMPT context when dying ? Christophe
[PATCH] powerpc/xive: Change IRQ domain to a tree domain
Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the 'nomap' domains still in use under the powerpc arch. With this new flag, the revmap_tree of the IRQ domain is not used anymore. This change broke the support of shared LSIs [1] in the XIVE driver because it was relying on a lookup in the revmap_tree to query previously mapped interrupts. Linux now creates two distinct IRQ mappings on the same HW IRQ which can lead to unexpected behavior in the drivers. The XIVE IRQ domain is not a direct mapping domain and its HW IRQ interrupt number space is rather large : 1M/socket on POWER9 and POWER10, change the XIVE driver to use a 'tree' domain type instead. [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon devices. Cc: Marc Zyngier Cc: sta...@vger.kernel.org # v5.14+ Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive") Signed-off-by: Cédric Le Goater --- Marc, The Fixes tag is there because the patch in question revealed that something was broken in XIVE. genirq is not in cause. However, I don't know for PS3 and Cell. May be less critical for now. arch/powerpc/sysdev/xive/common.c | 3 +-- arch/powerpc/sysdev/xive/Kconfig | 1 - 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index fed6fd16c8f4..9d0f0fe25598 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops = { static void __init xive_init_host(struct device_node *np) { - xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ, - _irq_domain_ops, NULL); + xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL); if (WARN_ON(xive_irq_domain == NULL)) return; irq_set_default_host(xive_irq_domain); diff --git a/arch/powerpc/sysdev/xive/Kconfig b/arch/powerpc/sysdev/xive/Kconfig index 97796c6b63f0..785c292d104b 100644 --- a/arch/powerpc/sysdev/xive/Kconfig +++ b/arch/powerpc/sysdev/xive/Kconfig @@ -3,7 +3,6 @@ config PPC_XIVE bool select PPC_SMP_MUXED_IPI select HARDIRQS_SW_RESEND - select IRQ_DOMAIN_NOMAP config PPC_XIVE_NATIVE bool -- 2.31.1
Re: [PATCH v2 2/5] preempt/dynamic: Introduce preempt mode accessors
Le 10/11/2021 à 21:24, Valentin Schneider a écrit : CONFIG_PREEMPT{_NONE, _VOLUNTARY} designate either: o The build-time preemption model when !PREEMPT_DYNAMIC o The default boot-time preemption model when PREEMPT_DYNAMIC IOW, using those on PREEMPT_DYNAMIC kernels is meaningless - the actual model could have been set to something else by the "preempt=foo" cmdline parameter. Introduce a set of helpers to determine the actual preemption mode used by the live kernel. Suggested-by: Marco Elver Signed-off-by: Valentin Schneider --- include/linux/sched.h | 16 kernel/sched/core.c | 11 +++ 2 files changed, 27 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 5f8db54226af..0640d5622496 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2073,6 +2073,22 @@ static inline void cond_resched_rcu(void) #endif } +#ifdef CONFIG_PREEMPT_DYNAMIC + +extern bool is_preempt_none(void); +extern bool is_preempt_voluntary(void); +extern bool is_preempt_full(void); Those are trivial tests supposed to be used in fast pathes. They should be static inlines in order to minimise the overhead. + +#else + +#define is_preempt_none() IS_ENABLED(CONFIG_PREEMPT_NONE) +#define is_preempt_voluntary() IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY) +#define is_preempt_full() IS_ENABLED(CONFIG_PREEMPT) Would be better to use static inlines here as well instead of macros. + +#endif + +#define is_preempt_rt() IS_ENABLED(CONFIG_PREEMPT_RT) + /* * Does a critical section need to be broken due to another * task waiting?: (technically does not depend on CONFIG_PREEMPTION, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 97047aa7b6c2..9db7f77e53c3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6638,6 +6638,17 @@ static void __init preempt_dynamic_init(void) } } +#define PREEMPT_MODE_ACCESSOR(mode) \ + bool is_preempt_##mode(void) \ + { \ + WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \ Not sure using WARN_ON is a good idea here, as it may be called very early, see comment on powerpc patch. + return preempt_dynamic_mode == preempt_dynamic_##mode; \ + } I'm not sure that's worth a macro. You only have 3 accessors, 2 lines of code each. Just define all 3 in plain text. CONFIG_PREEMPT_DYNAMIC is based on using strategies like static_calls in order to minimise the overhead. For those accessors you should use the same kind of approach and use things like jump_labels in order to not redo the test at each time and minimise overhead as much as possible. + +PREEMPT_MODE_ACCESSOR(none) +PREEMPT_MODE_ACCESSOR(voluntary) +PREEMPT_MODE_ACCESSOR(full) + #else /* !CONFIG_PREEMPT_DYNAMIC */ static inline void preempt_dynamic_init(void) { }
[PATCH 4/7] KVM: mips: Use Makefile.kvm for common files
From: David Woodhouse Signed-off-by: David Woodhouse --- arch/mips/kvm/Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/mips/kvm/Makefile b/arch/mips/kvm/Makefile index d3710959da55..21ff75bcdbc4 100644 --- a/arch/mips/kvm/Makefile +++ b/arch/mips/kvm/Makefile @@ -2,9 +2,10 @@ # Makefile for KVM support for MIPS # +include $(srctree)/virt/kvm/Makefile.kvm + ccflags-y += -Ivirt/kvm -Iarch/mips/kvm -kvm-y := $(addprefix ../../../virt/kvm/, kvm_main.o coalesced_mmio.o eventfd.o binary_stats.o) kvm-$(CONFIG_CPU_HAS_MSA) += msa.o kvm-y +=mips.o emulate.o entry.o \ -- 2.31.1
[PATCH 5/7] KVM: RISC-V: Use Makefile.kvm for common files
From: David Woodhouse Signed-off-by: David Woodhouse --- arch/riscv/kvm/Makefile | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile index 30cdd1df0098..300590225348 100644 --- a/arch/riscv/kvm/Makefile +++ b/arch/riscv/kvm/Makefile @@ -5,14 +5,10 @@ ccflags-y += -I $(srctree)/$(src) -KVM := ../../../virt/kvm +include $(srctree)/virt/kvm/Makefile.kvm obj-$(CONFIG_KVM) += kvm.o -kvm-y += $(KVM)/kvm_main.o -kvm-y += $(KVM)/coalesced_mmio.o -kvm-y += $(KVM)/binary_stats.o -kvm-y += $(KVM)/eventfd.o kvm-y += main.o kvm-y += vm.o kvm-y += vmid.o -- 2.31.1
[PATCH 3/7] KVM: s390: Use Makefile.kvm for common files
From: David Woodhouse Signed-off-by: David Woodhouse --- arch/s390/kvm/Makefile | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/s390/kvm/Makefile b/arch/s390/kvm/Makefile index b3aaadc60ead..e4f50453cf7f 100644 --- a/arch/s390/kvm/Makefile +++ b/arch/s390/kvm/Makefile @@ -3,13 +3,11 @@ # # Copyright IBM Corp. 2008 -KVM := ../../../virt/kvm -common-objs = $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/async_pf.o \ - $(KVM)/irqchip.o $(KVM)/vfio.o $(KVM)/binary_stats.o +include $(srctree)/virt/kvm/Makefile.kvm ccflags-y := -Ivirt/kvm -Iarch/s390/kvm -kvm-objs := $(common-objs) kvm-s390.o intercept.o interrupt.o priv.o sigp.o +kvm-objs := kvm-s390.o intercept.o interrupt.o priv.o sigp.o kvm-objs += diag.o gaccess.o guestdbg.o vsie.o pv.o obj-$(CONFIG_KVM) += kvm.o -- 2.31.1
[PATCH 7/7] KVM: arm64: Use Makefile.kvm for common files
From: David Woodhouse Signed-off-by: David Woodhouse --- arch/arm64/kvm/Makefile | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile index 989bb5dad2c8..04a53f71a6b6 100644 --- a/arch/arm64/kvm/Makefile +++ b/arch/arm64/kvm/Makefile @@ -5,14 +5,12 @@ ccflags-y += -I $(srctree)/$(src) -KVM=../../../virt/kvm +include $(srctree)/virt/kvm/Makefile.kvm obj-$(CONFIG_KVM) += kvm.o obj-$(CONFIG_KVM) += hyp/ -kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \ -$(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \ -arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \ +kvm-y += arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \ inject_fault.o va_layout.o handle_exit.o \ guest.o debug.o reset.o sys_regs.o \ vgic-sys-reg-v3.o fpsimd.o pmu.o \ -- 2.31.1
[PATCH 2/7] KVM: Add Makefile.kvm for common files, use it for x86
From: David Woodhouse Splitting kvm_main.c out into smaller and better-organized files is slightly non-trivial when it involves editing a bunch of per-arch KVM makefiles. Provide virt/kvm/Makefile.kvm for them to include. Signed-off-by: David Woodhouse --- arch/x86/kvm/Makefile | 7 +-- virt/kvm/Makefile.kvm | 13 + 2 files changed, 14 insertions(+), 6 deletions(-) create mode 100644 virt/kvm/Makefile.kvm diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index 75dfd27b6e8a..30f244b64523 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,12 +7,7 @@ ifeq ($(CONFIG_FRAME_POINTER),y) OBJECT_FILES_NON_STANDARD_vmenter.o := y endif -KVM := ../../../virt/kvm - -kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \ - $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \ - $(KVM)/dirty_ring.o $(KVM)/binary_stats.o -kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o +include $(srctree)/virt/kvm/Makefile.kvm kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \ i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \ diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm new file mode 100644 index ..ee9c310f3601 --- /dev/null +++ b/virt/kvm/Makefile.kvm @@ -0,0 +1,13 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Makefile for Kernel-based Virtual Machine module +# + +KVM ?= ../../../virt/kvm + +kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o +kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o +kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o +kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o +kvm-$(CONFIG_HAVE_KVM_IRQCHIP) += $(KVM)/irqchip.o +kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o -- 2.31.1
[PATCH 6/7] KVM: powerpc: Use Makefile.kvm for common files
From: David Woodhouse It's all fairly baroque but in the end, I don't think there's any reason for $(KVM)/irqchip.o to have been handled differently, as they all end up in $(kvm-y) in the end anyway, regardless of whether they get there via $(common-objs-y) and the CPU-specific object lists. The generic Makefile.kvm uses HAVE_KVM_IRQCHIP for irqchip.o instead of HAVE_KVM_IRQ_ROUTING. That change is fine (and arguably correct) because they are both set together for KVM_MPIC, or neither is set. Signed-off-by: David Woodhouse --- arch/powerpc/kvm/Makefile | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile index 583c14ef596e..245f59118413 100644 --- a/arch/powerpc/kvm/Makefile +++ b/arch/powerpc/kvm/Makefile @@ -4,11 +4,8 @@ # ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm -KVM := ../../../virt/kvm -common-objs-y = $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o -common-objs-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o -common-objs-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o +include $(srctree)/virt/kvm/Makefile.kvm common-objs-y += powerpc.o emulate_loadstore.o obj-$(CONFIG_KVM_EXIT_TIMING) += timing.o @@ -125,7 +122,6 @@ kvm-book3s_32-objs := \ kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs) kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o -kvm-objs-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o kvm-objs := $(kvm-objs-m) $(kvm-objs-y) -- 2.31.1
[PATCH 1/7] KVM: Introduce CONFIG_HAVE_KVM_DIRTY_RING
From: David Woodhouse I'd like to make the build include dirty_ring.c based on whether the arch wants it or not. That's a whole lot simpler if there's a config symbol instead of doing it implicitly on KVM_DIRTY_LOG_PAGE_OFFSET being set to something non-zero. Signed-off-by: David Woodhouse --- arch/x86/kvm/Kconfig | 1 + include/linux/kvm_dirty_ring.h | 8 virt/kvm/Kconfig | 3 +++ virt/kvm/kvm_main.c| 4 ++-- 4 files changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 619186138176..d7fa0a42ac25 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -27,6 +27,7 @@ config KVM select MMU_NOTIFIER select HAVE_KVM_IRQCHIP select HAVE_KVM_IRQFD + select HAVE_KVM_DIRTY_RING select IRQ_BYPASS_MANAGER select HAVE_KVM_IRQ_BYPASS select HAVE_KVM_IRQ_ROUTING diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h index 120e5e90fa1d..4da8d4a4140b 100644 --- a/include/linux/kvm_dirty_ring.h +++ b/include/linux/kvm_dirty_ring.h @@ -27,9 +27,9 @@ struct kvm_dirty_ring { int index; }; -#if (KVM_DIRTY_LOG_PAGE_OFFSET == 0) +#ifndef CONFIG_HAVE_KVM_DIRTY_RING /* - * If KVM_DIRTY_LOG_PAGE_OFFSET not defined, kvm_dirty_ring.o should + * If CONFIG_HAVE_HVM_DIRTY_RING not defined, kvm_dirty_ring.o should * not be included as well, so define these nop functions for the arch. */ static inline u32 kvm_dirty_ring_get_rsvd_entries(void) @@ -74,7 +74,7 @@ static inline bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring) return true; } -#else /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */ +#else /* CONFIG_HAVE_KVM_DIRTY_RING */ u32 kvm_dirty_ring_get_rsvd_entries(void); int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size); @@ -98,6 +98,6 @@ struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset); void kvm_dirty_ring_free(struct kvm_dirty_ring *ring); bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring); -#endif /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */ +#endif /* CONFIG_HAVE_KVM_DIRTY_RING */ #endif /* KVM_DIRTY_RING_H */ diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 62b39149b8c8..97cf5413ac25 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -13,6 +13,9 @@ config HAVE_KVM_IRQFD config HAVE_KVM_IRQ_ROUTING bool +config HAVE_KVM_DIRTY_RING + bool + config HAVE_KVM_EVENTFD bool select EVENTFD diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9646bb9112c1..356d636e037d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3411,7 +3411,7 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); static bool kvm_page_in_dirty_ring(struct kvm *kvm, unsigned long pgoff) { -#if KVM_DIRTY_LOG_PAGE_OFFSET > 0 +#ifdef CONFIG_HAVE_KVM_DIRTY_RING return (pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) && (pgoff < KVM_DIRTY_LOG_PAGE_OFFSET + kvm->dirty_ring_size / PAGE_SIZE); @@ -4114,7 +4114,7 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) case KVM_CAP_NR_MEMSLOTS: return KVM_USER_MEM_SLOTS; case KVM_CAP_DIRTY_LOG_RING: -#if KVM_DIRTY_LOG_PAGE_OFFSET > 0 +#ifdef KVM_HAVE_KVM_DIRTY_RING return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn); #else return 0; -- 2.31.1
[PATCH 0/7] KVM: Add Makefile.kvm for common files
On Mon, 2021-11-15 at 20:26 +0100, Paolo Bonzini wrote: > > > Also, for the small requests: since you are at it, can you add the code > > > in a new file under virt/kvm/? > > > > Hm... only if I can make hva_to_pfn() and probably a handful of other > > things non-static? > > Yes, I think sooner or later we also want all pfn stuff in one file > (together with MMU notifiers) and all hva stuff in another; so for now > you can create virt/kvm/hva_to_pfn.h, or virt/kvm/mm.h, or whatever > color of the bikeshed you prefer. OK... let's start with this. David Woodhouse (7): KVM: Introduce CONFIG_HAVE_KVM_DIRTY_RING KVM: Add Makefile.kvm for common files, use it for x86 KVM: s390: Use Makefile.kvm for common files KVM: mips: Use Makefile.kvm for common files KVM: RISC-V: Use Makefile.kvm for common files KVM: powerpc: Use Makefile.kvm for common files KVM: arm64: Use Makefile.kvm for common files arch/arm64/kvm/Makefile| 6 ++ arch/mips/kvm/Makefile | 3 ++- arch/powerpc/kvm/Makefile | 6 +- arch/riscv/kvm/Makefile| 6 +- arch/s390/kvm/Makefile | 6 ++ arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/Makefile | 7 +-- include/linux/kvm_dirty_ring.h | 8 virt/kvm/Kconfig | 3 +++ virt/kvm/Makefile.kvm | 13 + virt/kvm/kvm_main.c| 4 ++-- 11 files changed, 32 insertions(+), 31 deletions(-) smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 16/39] irqdomain: Make normal and nomap irqdomains exclusive
Hello Marc, This patch is breaking the POWER9/POWER10 XIVE driver (these are not old PPC systems :) on machines sharing the same LSI HW IRQ. For instance, a linux KVM guest with a virtio-rng and a virtio-balloon device. In that case, Linux creates two distinct IRQ mappings which can lead to some unexpected behavior. Either the irq domain translates, or it doesn't. If the driver creates a nomap domain, and yet expects some sort of translation to happen, then the driver is fundamentally broken. And even without that: how do you end-up with a single HW interrupt having two mappings? A fix to go forward would be to change the XIVE IRQ domain to use a 'Tree' domain for reverse mapping and not the 'No Map' domain mapping. I will keep you updated for XIVE. I bet there is a bit more to it. From what you are saying above, something rather ungodly is happening in the XIVE code. It's making progress. This change in irq_find_mapping() is what 'breaks' XIVE : + if (irq_domain_is_nomap(domain)) { + if (hwirq < domain->revmap_size) { + data = irq_domain_get_irq_data(domain, hwirq); + if (data && data->hwirq == hwirq) + return hwirq; + } + + return 0; With the introduction of IRQ_DOMAIN_FLAG_NO_MAP, the revmap_tree lookup is skipped and the previously mapped IRQ is not found. XIVE was relying on a side effect of irq_domain_set_mapping() which is not true anymore. I guess the easiest fix for 5.14 and 5.15 (in which was introduced MSI domains) is to change the XIVE IRQ domain to a domain tree. Since the HW can handle 1MB interrupts, this looks like a better choice for the driver. Thanks, C.
Re: Build regressions/improvements in v5.16-rc1
On Mon, Nov 15, 2021 at 05:12:50PM +0100, Geert Uytterhoeven wrote: > > + error: modpost: "mips_cm_is64" [drivers/pci/controller/pcie-mt7621.ko] > > undefined!: => N/A > > + error: modpost: "mips_cm_lock_other" > > [drivers/pci/controller/pcie-mt7621.ko] undefined!: => N/A > > + error: modpost: "mips_cm_unlock_other" > > [drivers/pci/controller/pcie-mt7621.ko] undefined!: => N/A > > + error: modpost: "mips_cpc_base" [drivers/pci/controller/pcie-mt7621.ko] > > undefined!: => N/A > > + error: modpost: "mips_gcr_base" [drivers/pci/controller/pcie-mt7621.ko] > > undefined!: => N/A > > mips-allmodconfig there is a patchset fixing this https://lore.kernel.org/all/2025070809.15529-1-sergio.paracuel...@gmail.com/ > > 3 warning regressions: > > + : warning: #warning syscall futex_waitv not implemented [-Wcpp]: > > => 1559:2 > > powerpc, m68k, mips, s390, parisc (and probably more) I've queued a patch to fix this for mips. Thomas. -- Crap can work. Given enough thrust pigs will fly, but it's not necessarily a good idea.[ RFC1925, 2.3 ]
Re: [PATCH 0/3] KEXEC_SIG with appended signature
On Mon, Nov 15, 2021 at 06:53:53PM -0500, Nayna wrote: > > On 11/12/21 03:30, Michal Suchánek wrote: > > Hello, > > > > On Thu, Nov 11, 2021 at 05:26:41PM -0500, Nayna wrote: > > > On 11/8/21 07:05, Michal Suchánek wrote: > > > > Hello, > > > > > > > > The other part is that distributions apply 'lockdown' patches that > > > > change > > > > the security policy depending on secure boot status which were rejected > > > > by upstream which only hook into the _SIG options, and not into the IMA_ > > > > options. Of course, I expect this to change when the IMA options are > > > > universally available across architectures and the support picked up by > > > > distributions. > > > > > > > > Which brings the third point: IMA features vary across architectures, > > > > and KEXEC_SIG is more common than IMA_KEXEC. > > > > > > > > config/arm64/default:CONFIG_HAVE_IMA_KEXEC=y > > > > config/ppc64le/default:CONFIG_HAVE_IMA_KEXEC=y > > > > > > > > config/arm64/default:CONFIG_KEXEC_SIG=y > > > > config/s390x/default:CONFIG_KEXEC_SIG=y > > > > config/x86_64/default:CONFIG_KEXEC_SIG=y > > > > > > > > KEXEC_SIG makes it much easier to get uniform features across > > > > architectures. > > > Architectures use KEXEC_SIG vs IMA_KEXEC based on their requirement. > > > IMA_KEXEC is for the kernel images signed using sign-file (appended > > > signatures, not PECOFF), provides measurement along with verification, and > > That's certainly not the case. S390 uses appended signatures with > > KEXEC_SIG, arm64 uses PECOFF with both KEXEC_SIG and IMA_KEXEC. > > Yes, S390 uses appended signature, but they also do not support > measurements. > > On the other hand for arm64/x86, PECOFF works only with KEXEC_SIG. Look at > the KEXEC_IMAGE_VERIFY_SIG config dependencies in arch/arm64/Kconfig and > KEXEC_BZIMAGE_VERIFY_SIG config dependencies in arch/x86/Kconfig. Now, if > KEXEC_SIG is not enabled, then IMA appraisal policies are enforced if secure > boot is enabled, refer to security/integrity/ima_efi.c . IMA would fail > verification if kernel is not signed with module sig appended signatures or > signature verification fails. > > In short, IMA is used to enforce the existence of a policy if secure boot is > enabled. If they don't support module sig appended signatures, by definition > it fails. Thus PECOFF doesn't work with both KEXEC_SIG and IMA_KEXEC, but > only with KEXEC_SIG. Then IMA_KEXEC is a no-go. It is not supported on all architectures and it principially cannot be supported because it does not support PECOFF which is needed to boot the kernel on EFI platforms. To get feature parity across architectures KEXEC_SIG is required. > > > > > is tied to secureboot state of the system at boot time. > > In distrubutions it's also the case with KEXEC_SIG, it's only upstream > > where this is different. I don't know why Linux upstream has rejected > > this support for KEXEC_SIG. > > > > Anyway, sounds like the difference is that IMA provides measurement but > > if you don't use it it does not makes any difference except more comlex > > code. > I am unsure what do you mean by "complex code" here. Can you please > elaborate ? IMA policies support for secureboot already exists and can be > used as it is without adding any extra work as in > arch/powerpc/kernel/ima_arch.c. The code exists but using it to replace KEXEC_SIG also requires understanding the code and the implications of using it. At a glance the IMA codebase is much bigger and more convoluted compared to KEXEC_SIG and MODULE_SIG. Thanks Michal
Re: [PATCH v2 3/3] soc: fsl: Replace kernel.h with the necessary inclusions
On Mon, Nov 15, 2021 at 10:24:36PM +, Leo Li wrote: > > From: Andy Shevchenko > > Sent: Monday, November 15, 2021 5:30 AM > > On Wed, Nov 10, 2021 at 12:59:52PM +0200, Andy Shevchenko wrote: ... > > > v2: updated Cc list based on previous changes to MAINTAINERS > > > > Any comments on this, please? > > > > I really want to decrease amount of kernel.h usage in the common headers. > > So others won't copy'n'paste bad example. > > There seems to be no problem with the patch although I didn't get time to > really compile with it applied. > > Will pick them up later after build test. Thank you! Note, it has two fixes against MAINTAINERS which may be sent, I believe, sooner than later to Linus. -- With Best Regards, Andy Shevchenko
Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.
On 15.11.21 20:37, Zi Yan wrote: > From: Zi Yan > > Hi David, Hi, thanks for looking into this. > > You suggested to make alloc_contig_range() deal with pageblock_order instead > of > MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This > patchset is my attempt to achieve that. Please take a look and let me know if > I am doing it correctly or not. > > From what my understanding, cma required alignment of > max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was introduced, > __free_one_page() does not prevent merging two different pageblocks, when > MAX_ORDER - 1 > pageblock_order. But current __free_one_page() implementation > does prevent that. It should be OK to just align cma to pageblock_order. > alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use > pageblock_order as alignment too. I wonder if that's sufficient. Especially the outer_start logic in alloc_contig_range() might be problematic. There are some ugly corner cases with free pages/allocations spanning multiple pageblocks and we only isolated a single pageblock. Regarding CMA, we have to keep the following cases working: a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER - 1 page: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA, but not both. We have to make sure alloc_contig_range() keeps working correctly. This should be the case even with your change, as we won't merging pages accross differing migratetypes. b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume both are MIGRATE_CMA. Assume we want to either allocate from pageblock 0 or pageblock 1. Especially, assume we want to allocate from pageblock 1. While we would isolate pageblock 1, we wouldn't isolate pageblock 0. What happens if we either have a free page spanning the MAX_ORDER - 1 range already OR if we have to migrate a MAX_ORDER - 1 page, resulting in a free MAX_ORDER - 1 page of which only the second pageblock is isolated? We would end up essentially freeing a page that has mixed pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I might be wrong but I have the feeling that this would be problematic. c) Concurrent allocations: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume b) but we have two concurrent CMA allocations to pageblock 0 and pageblock 1, which would now be possible as start_isolate_page_range() isolate would succeed on both. Regarding virtio-mem, we care about the following cases: a) Allocating parts from completely movable MAX_ ORDER - 1 page: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume pageblock 0 and pageblock 1 are either free or contain only movable pages. Assume we allocated pageblock 0. We have to make sure we can allocate pageblock 1. The other way around, assume we allocated pageblock 1, we have to make sure we can allocate pageblock 0. Free pages spanning both pageblocks might be problematic. b) Allocate parts of partially movable MAX_ ORDER - 1 page: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume pageblock 0 contains unmovable data but pageblock 1 not: we have to make sure we can allocate pageblock 1. Similarly, assume pageblock 1 contains unmovable data but pageblock 0 no: we have to make sure we can allocate pageblock 1. has_unmovable_pages() might allow for that. But, we want to fail early in case we want to allocate a single pageblock but it contains unmovable data. This could be either directly or indirectly. If we have an unmovable (compound) MAX_ ORDER - 1 and we'd try isolating pageblock 1, has_unmovable_pages() would always return "false" because we'd simply be skiping over any tail pages, and not detect the un-movability. c) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: Same concern as for CMA b) So the biggest concern I have is dealing with migrating/freeing > pageblock_order pages while only having isolated a single pageblock. > > In terms of virtio_mem, if I understand correctly, it relies on > alloc_contig_range() to obtain contiguous free pages and offlines them to > reduce > guest memory size. As the result of alloc_contig_range() alignment change, > virtio_mem should be able to just align PFNs to pageblock_order. For virtio-mem it will most probably be desirable to first try allocating the MAX_ORDER -1 range if possible and then fallback to pageblock_order. But that's an additional change on top in virtio-mem code. My take to teach alloc_contig_range() to properly handle would be the following: a) Convert MIGRATE_ISOLATE into a separate pageblock flag We would want to convert MIGRATE_ISOLATE into a separate pageblock flags, such that when we isolate a page block we preserve the original migratetype. While start_isolate_page_range()
Re: [PATCH 16/39] irqdomain: Make normal and nomap irqdomains exclusive
On Mon, 15 Nov 2021 19:05:17 +, Cédric Le Goater wrote: > > Hello Mark, s/k/c/ > > On 5/20/21 18:37, Marc Zyngier wrote: > > Direct mappings are completely exclusive of normal mappings, meaning > > that we can refactor the code slightly so that we can get rid of > > the revmap_direct_max_irq field and use the revmap_size field > > instead, reducing the size of the irqdomain structure. > > > > Signed-off-by: Marc Zyngier > > > This patch is breaking the POWER9/POWER10 XIVE driver (these are not > old PPC systems :) on machines sharing the same LSI HW IRQ. For instance, > a linux KVM guest with a virtio-rng and a virtio-balloon device. In that > case, Linux creates two distinct IRQ mappings which can lead to some > unexpected behavior. Either the irq domain translates, or it doesn't. If the driver creates a nomap domain, and yet expects some sort of translation to happen, then the driver is fundamentally broken. And even without that: how do you end-up with a single HW interrupt having two mappings? > A fix to go forward would be to change the XIVE IRQ domain to use a > 'Tree' domain for reverse mapping and not the 'No Map' domain mapping. > I will keep you updated for XIVE. I bet there is a bit more to it. From what you are saying above, something rather ungodly is happening in the XIVE code. M. -- Without deviation from the norm, progress is not possible.