Re: [PATCH 3/7] KVM: s390: Use Makefile.kvm for common files

2021-11-16 Thread Christian Borntraeger




Am 16.11.21 um 12:50 schrieb David Woodhouse:

From: David Woodhouse 

Signed-off-by: David Woodhouse 


Looks good.
Reviewed-by: Christian Borntraeger 


---
  arch/s390/kvm/Makefile | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kvm/Makefile b/arch/s390/kvm/Makefile
index b3aaadc60ead..e4f50453cf7f 100644
--- a/arch/s390/kvm/Makefile
+++ b/arch/s390/kvm/Makefile
@@ -3,13 +3,11 @@
  #
  # Copyright IBM Corp. 2008
  
-KVM := ../../../virt/kvm

-common-objs = $(KVM)/kvm_main.o $(KVM)/eventfd.o  $(KVM)/async_pf.o \
- $(KVM)/irqchip.o $(KVM)/vfio.o $(KVM)/binary_stats.o
+include $(srctree)/virt/kvm/Makefile.kvm
  
  ccflags-y := -Ivirt/kvm -Iarch/s390/kvm
  
-kvm-objs := $(common-objs) kvm-s390.o intercept.o interrupt.o priv.o sigp.o

+kvm-objs := kvm-s390.o intercept.o interrupt.o priv.o sigp.o
  kvm-objs += diag.o gaccess.o guestdbg.o vsie.o pv.o
  
  obj-$(CONFIG_KVM) += kvm.o




Re: [RESEND PATCH v5 0/4] Add perf interface to expose nvdimm

2021-11-16 Thread kajoljain



On 11/16/21 8:29 PM, LEROY Christophe wrote:
> Hi
> 
> Le 16/11/2021 à 05:49, Kajol Jain a écrit :
>> Patchset adds performance stats reporting support for nvdimm.
>> Added interface includes support for pmu register/unregister
>> functions. A structure is added called nvdimm_pmu to be used for
>> adding arch/platform specific data such as cpumask, nvdimm device
>> pointer and pmu event functions like event_init/add/read/del.
>> User could use the standard perf tool to access perf events
>> exposed via pmu.
>>
>> Interface also defines supported event list, config fields for the
>> event attributes and their corresponding bit values which are exported
>> via sysfs. Patch 3 exposes IBM pseries platform nmem* device
>> performance stats using this interface.
> 
> You resending your v5 series ? Is there any news since your submittion 
> last month that's awaiting in patchwork here at 
> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=264422 ?

Hi Christophe,
  There is no code side changes from the last v5 version. Since, I
am looking for reviews, again posted this patchset with RESEND tag.

Thanks,
Kajol Jain

> 
> Christophe
> 
> 
>>
>> Result from power9 pseries lpar with 2 nvdimm device:
>>
>> Ex: List all event by perf list
>>
>> command:# perf list nmem
>>
>>nmem0/cache_rh_cnt/[Kernel PMU event]
>>nmem0/cache_wh_cnt/[Kernel PMU event]
>>nmem0/cri_res_util/[Kernel PMU event]
>>nmem0/ctl_res_cnt/ [Kernel PMU event]
>>nmem0/ctl_res_tm/  [Kernel PMU event]
>>nmem0/fast_w_cnt/  [Kernel PMU event]
>>nmem0/host_l_cnt/  [Kernel PMU event]
>>nmem0/host_l_dur/  [Kernel PMU event]
>>nmem0/host_s_cnt/  [Kernel PMU event]
>>nmem0/host_s_dur/  [Kernel PMU event]
>>nmem0/med_r_cnt/   [Kernel PMU event]
>>nmem0/med_r_dur/   [Kernel PMU event]
>>nmem0/med_w_cnt/   [Kernel PMU event]
>>nmem0/med_w_dur/   [Kernel PMU event]
>>nmem0/mem_life/[Kernel PMU event]
>>nmem0/poweron_secs/[Kernel PMU event]
>>...
>>nmem1/mem_life/[Kernel PMU event]
>>nmem1/poweron_secs/[Kernel PMU event]
>>
>> Patch1:
>>  Introduces the nvdimm_pmu structure
>> Patch2:
>>  Adds common interface to add arch/platform specific data
>>  includes nvdimm device pointer, pmu data along with
>>  pmu event functions. It also defines supported event list
>>  and adds attribute groups for format, events and cpumask.
>>  It also adds code for cpu hotplug support.
>> Patch3:
>>  Add code in arch/powerpc/platform/pseries/papr_scm.c to expose
>>  nmem* pmu. It fills in the nvdimm_pmu structure with pmu name,
>>  capabilities, cpumask and event functions and then registers
>>  the pmu by adding callbacks to register_nvdimm_pmu.
>> Patch4:
>>  Sysfs documentation patch
>>
>> Changelog
>> ---
>> v4 -> v5:
>> - Remove multiple variables defined in nvdimm_pmu structure include
>>name and pmu functions(event_int/add/del/read) as they are just
>>used to copy them again in pmu variable. Now we are directly doing
>>this step in arch specific code as suggested by Dan Williams.
>>
>> - Remove attribute group field from nvdimm pmu structure and
>>defined these attribute groups in common interface which
>>includes format, event list along with cpumask as suggested by
>>Dan Williams.
>>Since we added static defination for attrbute groups needed in
>>common interface, removes corresponding code from papr.
>>
>> - Add nvdimm pmu event list with event codes in the common interface.
>>
>> - Remove Acked-by/Reviewed-by/Tested-by tags as code is refactored
>>to handle review comments from Dan.
>>
>> - Make nvdimm_pmu_free_hotplug_memory function static as reported
>>by kernel test robot, also add corresponding Reported-by tag.
>>
>> - Link to the patchset v4: https://lkml.org/lkml/2021/9/3/45
>>
>> v3 -> v4
>> - Rebase code on top of current papr_scm code without any logical
>>changes.
>>
>> - Added Acked-by tag from Peter Zijlstra and Reviewed by tag
>>from Madhavan Srinivasan.
>>
>> - Link to the patchset v3: https://lkml.org/lkml/2021/6/17/605
>>
>> v2 -> v3
>> - Added Tested-by tag.
>>
>> - Fix nvdimm mailing list in the ABI Documentation.
>>
>> - Link to the patchset v2: https://lkml.org/lkml/2021/6/14/25
>>
>> v1 -> v2
>> - Fix hotplug code by adding pmu migration call
>>   

Re: [PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements

2021-11-16 Thread Nathan Lynch
Nathan Lynch  writes:
>
> Signed-off-by: Nathan Lynch 
> ---
>
> Notes:
> This could be considered a sequel to:
> 
> 
> https://lore.kernel.org/linuxppc-dev/20210504030358.1715034-1-nath...@linux.ibm.com/
> 
> which tried to address both the suboptimal delay behavior as well as 
> naming
> issues. The present change achieves the performance improvement and leaves
> the matter of naming for another time.
> 
> I've incorporated Alexey's feedback from that series to avoid division 
> when
> calculating the arguments to usleep_range().

Oops, this part is redundant with information in the cover. Git should
discard it if/when applied, though.


Re: [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic

2021-11-16 Thread Hari Bathini




On 11/11/21 6:50 pm, Nicholas Piggin wrote:

Excerpts from Hari Bathini's message of November 11, 2021 10:06 pm:



On 11/11/21 11:44 am, Michael Ellerman wrote:

Hari Bathini  writes:

In panic path, fadump is triggered via a panic notifier function.
Before calling panic notifier functions, smp_send_stop() gets called,
which stops all CPUs except the panic'ing CPU. Commit 8389b37dffdc
("powerpc: stop_this_cpu: remove the cpu from the online map.") and
again commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
started marking CPUs as offline while stopping them. So, if a kernel
has either of the above commits, vmcore captured with fadump via panic
path would show only the panic'ing CPU as online. Sample output of
crash-utility with such vmcore:

# crash vmlinux vmcore
...
  KERNEL: vmlinux
DUMPFILE: vmcore  [PARTIAL DUMP]
CPUS: 1
DATE: Wed Nov 10 09:56:34 EST 2021
  UPTIME: 00:00:42
LOAD AVERAGE: 2.27, 0.69, 0.24
   TASKS: 183
NODENAME: X
 RELEASE: 5.15.0+
 VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
 MACHINE: ppc64le  (2500 Mhz)
  MEMORY: 8 GB
   PANIC: "Kernel panic - not syncing: sysrq triggered crash"
 PID: 3394
 COMMAND: "bash"
TASK: c000150a5f80  [THREAD_INFO: c000150a5f80]
 CPU: 1
   STATE: TASK_RUNNING (PANIC)

crash> p -x __cpu_online_mask
__cpu_online_mask = $1 = {
  bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
crash>
crash>
crash> p -x __cpu_active_mask
__cpu_active_mask = $2 = {
  bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
crash>

While this has been the case since fadump was introduced, the issue
was not identified for two probable reasons:

- In general, the bulk of the vmcores analyzed were from crash
  due to exception.

- The above did change since commit 8341f2f222d7 ("sysrq: Use
  panic() to force a crash") started using panic() instead of
  deferencing NULL pointer to force a kernel crash. But then
  commit de6e5d38417e ("powerpc: smp_send_stop do not offline
  stopped CPUs") stopped marking CPUs as offline till kernel
  commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
  reverted that change.

To avoid vmcore from showing only one CPU as online in panic path,
skip marking non panic'ing CPUs as offline while stopping them
during fadump crash.


Is this really worth the added complexity/bug surface?

Why does it matter if the vmcore shows only one CPU online?


We lose out on backtrace/register data of non-crashing CPUs as they
are explicitly marked offline.

Actually, the state of CPU resources is explicitly changed after the
panic though the aim of vmcore is to capture the system state at the
time of panic...

Alternatively, how about moving crash_fadump() call from panic notifier
into panic() function explicitly, just like __crash_kexec() - before the
smp_send_stop() call, so as to remove dependency with smp_send_stop()
implementation altogether...


Does the crash dump code snapshot the CPUs with send_debuggr_break? I
can't remember the exact flow. But if it sends NMI IPIs to all other
CPUs then maybe it could be moved before smp_send_stop, good idea.


Except for crashing CPU, snapshot of register data for all other CPUs is
collected by f/w on ibm,os-term rtas call.



The comment talks about printk_safe_flush_on_panic(), and this change
would presumably break that. Except that printk_safe_flush_on_panic() no
longer exists.

So do we need to set the CPU online here at all?

ie. could we revert bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
now that printk_safe_flush_on_panic() no longer exists?


Yeah, sounds like the logical thing to do but I guess, Nick would be in
a better position to answer this..


Maybe we could look at reverting it, it would be nice. But I think it
would be good to consider moving crash_fadump as well. There is at


OK, I will try moving crash_fadump() instead and post the patch..

Thanks
Hari


[PATCH v2 2/2] powerpc/rtas: rtas_busy_delay_time() kernel-doc

2021-11-16 Thread Nathan Lynch
Provide API documentation for rtas_busy_delay_time(), explaining why we
return the same value for 9900 and -2.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/kernel/rtas.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index d686834fe7f5..4cfe9f93a9cd 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -492,8 +492,25 @@ int rtas_call(int token, int nargs, int nret, int 
*outputs, ...)
 }
 EXPORT_SYMBOL(rtas_call);
 
-/* For RTAS_BUSY (-2), delay for 1 millisecond.  For an extended busy status
- * code of 990n, perform the hinted delay of 10^n (last digit) milliseconds.
+/**
+ * rtas_busy_delay_time() - From an RTAS status value, calculate the
+ *  suggested delay time in milliseconds.
+ *
+ * @status: a value returned from rtas_call() or similar APIs which return
+ *  the status of a RTAS function call.
+ *
+ * Context: Any context.
+ *
+ * Return:
+ * * 10 - If @status is 9905.
+ * * 1  - If @status is 9904.
+ * * 1000   - If @status is 9903.
+ * * 100- If @status is 9902.
+ * * 10 - If @status is 9901.
+ * * 1  - If @status is either 9900 or -2. This is "wrong" for -2, but
+ *some callers depend on this behavior, and the worst outcome
+ *is that they will delay for longer than necessary.
+ * * 0  - If @status is not a busy or extended delay value.
  */
 unsigned int rtas_busy_delay_time(int status)
 {
-- 
2.31.1



[PATCH v2 0/2] powerpc/rtas: improved busy and extended delay status handling

2021-11-16 Thread Nathan Lynch
This can be considered a successor to:

https://lore.kernel.org/linuxppc-dev/20210504030358.1715034-1-nath...@linux.ibm.com/

which tried to address both the suboptimal delay behavior as well as API
issues. This version achieves the performance improvements and leaves major
API changes for another time.

Changes since v1:
* Drop major API changes.
* Avoid division when calculating the arguments to usleep_range() (Alexey).
* Improve kernel-doc for rtas_busy_delay(), rtas_busy_delay_time().

Nathan Lynch (2):
  powerpc/rtas: rtas_busy_delay() improvements
  powerpc/rtas: rtas_busy_delay_time() kernel-doc

 arch/powerpc/include/asm/rtas.h |  2 +-
 arch/powerpc/kernel/rtas.c  | 95 +
 2 files changed, 87 insertions(+), 10 deletions(-)

-- 
2.31.1



[PATCH v2 1/2] powerpc/rtas: rtas_busy_delay() improvements

2021-11-16 Thread Nathan Lynch
Generally RTAS cannot block, and in PAPR it is required to return control
to the OS within a few tens of microseconds. In order to support operations
which may take longer to complete, many RTAS primitives can return
intermediate -2 ("busy") or 990x ("extended delay") values, which indicate
that the OS should reattempt the same call with the same arguments at some
point in the future.

Current versions of PAPR are less than clear about this, but the intended
meanings of these values in more detail are:

RTAS_BUSY (-2): RTAS has suspended a potentially long-running operation in
order to meet its latency obligation and give the OS the opportunity to
perform other work. RTAS can resume making progress as soon as the OS
reattempts the call.

RTAS_EXTENDED_DELAY_{MIN...MAX} (9900-9905): RTAS must wait for an external
event to occur or for internal contention to resolve before it can complete
the requested operation. The value encodes a non-binding hint as to roughly
how long the OS should wait before calling again, but the OS is allowed to
reattempt the call sooner or even immediately.

Linux of course must take its own CPU scheduling obligations into account
when handling these statuses; e.g. a task which receives an RTAS_BUSY
status should check whether to reschedule before it attempts the RTAS call
again to avoid starving other tasks.

rtas_busy_delay() is a helper function that "consumes" a busy or extended
delay status. Common usage:

int rc;

do {
rc = rtas_call(rtas_token("some-function"), ...);
} while (rtas_busy_delay(rc));

/* convert rc to Linux error value, etc */

If rc is a busy or extended delay status, the caller can rely on
rtas_busy_delay() to perform an appropriate sleep or reschedule and return
nonzero. Other statuses are handled normally by the caller.

The current implementation of rtas_busy_delay() both oversleeps and
overuses the CPU:

*  It performs msleep() for all 990x and even when no delay is
   suggested (-2), but this is understood to actually sleep for two jiffies
   minimum in practice (20ms with HZ=100). 9900 (1ms) and 9901 (10ms)
   appear to be the most common extended delay statuses, and the
   oversleeping measurably lengthens DLPAR operations, which perform
   many RTAS calls.

*  It does not sleep on 990x unless need_resched() is true, causing code
   like the loop above to needlessly retry, wasting CPU time.

Alter the logic to align better with the intended meanings:

*  When passed RTAS_BUSY, perform cond_resched() and return without
   sleeping. The caller should reattempt immediately

*  Always sleep when passed an extended delay status, using usleep_range()
   for precise shorter sleeps. Limit the sleep time to one second even
   though there are higher architected values.

Change rtas_busy_delay()'s return type to bool to better reflect its usage,
and add kernel-doc.

rtas_busy_delay_time() is unchanged, even though it "incorrectly" returns 1
for RTAS_BUSY. There are users of that API with open-coded delay loops in
sensitive contexts that will have to be taken on an individual basis.

Brief results for addition and removal of 5GB memory on a small P9 PowerVM
partition follow. Load was generated with stress-ng --cpu N. For add,
elapsed time is greatly reduced without significant change in the number of
RTAS calls or time spent on CPU. For remove, elapsed time is modestly
reduced, with significant reductions in RTAS calls and time spent on CPU.

With no competing workload (- before, + after):

  Performance counter stats for 'bash -c echo "memory add count 20" > 
/sys/kernel/dlpar' (10 runs):

- 1,935  probe:rtas_call   #0.003 M/sec 
   ( +-  0.22% )
-609.99 msec task-clock#0.183 CPUs utilized 
   ( +-  0.19% )
+ 1,956  probe:rtas_call   #0.003 M/sec 
   ( +-  0.17% )
+618.56 msec task-clock#0.278 CPUs utilized 
   ( +-  0.11% )

-3.3322 +- 0.0670 seconds time elapsed  ( +-  2.01% )
+2. +- 0.0416 seconds time elapsed  ( +-  1.87% )

  Performance counter stats for 'bash -c echo "memory remove count 20" > 
/sys/kernel/dlpar' (10 runs):

- 6,224  probe:rtas_call   #0.008 M/sec 
   ( +-  2.57% )
-750.36 msec task-clock#0.190 CPUs utilized 
   ( +-  2.01% )
+   843  probe:rtas_call   #0.003 M/sec 
   ( +-  0.12% )
+250.66 msec task-clock#0.068 CPUs utilized 
   ( +-  0.17% )

-3.9394 +- 0.0890 seconds time elapsed  ( +-  2.26% )
+ 3.678 +- 0.113 seconds time elapsed  ( +-  3.07% )

With all CPUs 100% busy (- before, + after):

  Performance counter stats for 'bash -c echo "memory add count 20" > 
/sys/kernel/dlpar' (10 runs):

- 2,979  

Re: [PATCH 12/13] sysctl: add helper to register empty subdir

2021-11-16 Thread Luis Chamberlain
On Fri, May 29, 2020 at 08:03:02AM -0500, Eric W. Biederman wrote:
> Luis Chamberlain  writes:
> 
> > The way to create a subdirectory from the base set of directories
> > is a bit obscure, so provide a helper which makes this clear, and
> > also helps remove boiler plate code required to do this work.
> 
> I agreee calling:
> register_sysctl("fs/binfmt_misc", sysctl_mount_point)
> is a bit obscure but if you are going to make a wrapper
> please make it the trivial one liner above.
> 
> Say something that looks like:
>   struct sysctl_header *register_sysctl_mount_point(const char *path)
> {
>   return register_sysctl(path, sysctl_mount_point);
> }
> 
> And yes please talk about a mount point and not an empty dir, as these
> are permanently empty directories to serve as mount points.  There are
> some subtle but important permission checks this allows in the case of
> unprivileged mounts.
> 
> Further code like this belong in proc_sysctl.c next to all of the code
> it is related to so that it is easier to see how to refactor the code if
> necessary.

Alrighty, it's been a while since this kernel/sysctl.c kitchen sink
cleanup... so it's time to respin this now that the merge window is
open.  I already rebased patches, addressed all input and now just
waiting to fix any compilation errors.  I'm going to split the patches
up into real small sets so to ensure we just get this through becauase
getting this in otherwise is going to be hard.

I'd appreciate folk's review once the patches start going out. I think
a hard part will be deciding what tree this should got through.

  Luis


Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.

2021-11-16 Thread Zi Yan
On 16 Nov 2021, at 3:58, David Hildenbrand wrote:

> On 15.11.21 20:37, Zi Yan wrote:
>> From: Zi Yan 
>>
>> Hi David,
>
> Hi,
>
> thanks for looking into this.
>
>>
>> You suggested to make alloc_contig_range() deal with pageblock_order instead 
>> of
>> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This
>> patchset is my attempt to achieve that. Please take a look and let me know if
>> I am doing it correctly or not.
>>
>> From what my understanding, cma required alignment of
>> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was introduced,
>> __free_one_page() does not prevent merging two different pageblocks, when
>> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() implementation
>> does prevent that. It should be OK to just align cma to pageblock_order.
>> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use
>> pageblock_order as alignment too.
>
> I wonder if that's sufficient. Especially the outer_start logic in
> alloc_contig_range() might be problematic. There are some ugly corner
> cases with free pages/allocations spanning multiple pageblocks and we
> only isolated a single pageblock.

Thank you a lot for writing the list of these corner cases. They are
very helpful!

>
>
> Regarding CMA, we have to keep the following cases working:
>
> a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER
>- 1 page:
>[   MAX_ ORDER - 1 ]
>[ pageblock 0 | pageblock 1]
>
> Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA,
> but not both. We have to make sure alloc_contig_range() keeps working
> correctly. This should be the case even with your change, as we won't
> merging pages accross differing migratetypes.

Yes.

>
> b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated:
>[   MAX_ ORDER - 1 ]
>[ pageblock 0 | pageblock 1]
>
> Assume both are MIGRATE_CMA. Assume we want to either allocate from
> pageblock 0 or pageblock 1. Especially, assume we want to allocate from
> pageblock 1. While we would isolate pageblock 1, we wouldn't isolate
> pageblock 0.
>
> What happens if we either have a free page spanning the MAX_ORDER - 1
> range already OR if we have to migrate a MAX_ORDER - 1 page, resulting
> in a free MAX_ORDER - 1 page of which only the second pageblock is
> isolated? We would end up essentially freeing a page that has mixed
> pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I
> might be wrong but I have the feeling that this would be problematic.
>

This could happen when start_isolate_page_range() stumbles upon a compound
page with order >= pageblock_order or a free page with order >=
pageblock_order, but should not. start_isolate_page_range() should check
the actual page size, either compound page size or free page size, and set
the migratetype across pageblocks if the page is bigger than pageblock size.
More precisely set_migratetype_isolate() should do that.


> c) Concurrent allocations:
> [   MAX_ ORDER - 1 ]
> [ pageblock 0 | pageblock 1]
>
> Assume b) but we have two concurrent CMA allocations to pageblock 0 and
> pageblock 1, which would now be possible as start_isolate_page_range()
> isolate would succeed on both.

Two isolations will be serialized by the zone lock taken by
set_migratetype_isolate(), so the concurrent allocation would not be a problem.
If it is a MAX_ORDER-1 free page, the first comer should split it and only
isolate one of the pageblock then second one can isolate the other pageblock.
If it is a MAX_ORDER-1 compound page, the first comer should isolate both
pageblocks, then the second one would fail. WDYT?


In sum, it seems to me that the issue is page isolation code only sees
pageblock without check the actual page. When there are multiple pageblocks
belonging to one page, the problem appears. This should be fixed.

>
>
> Regarding virtio-mem, we care about the following cases:
>
> a) Allocating parts from completely movable MAX_ ORDER - 1 page:
>[   MAX_ ORDER - 1 ]
>[ pageblock 0 | pageblock 1]
>
> Assume pageblock 0 and pageblock 1 are either free or contain only
> movable pages. Assume we allocated pageblock 0. We have to make sure we
> can allocate pageblock 1. The other way around, assume we allocated
> pageblock 1, we have to make sure we can allocate pageblock 0.
>
> Free pages spanning both pageblocks might be problematic.

Can you elaborate a bit? If either of pageblock 0 and 1 is used by
virtio-mem, why do we care the other? If pageblock 0 and 1 belong to
the same page (either free or compound), they should have the same
migratetype. If we want to just allocate one of them, we can split
the free page or migrate the compound page then split the remaining
free page.

>
> b) Allocate parts of partially movable MAX_ ORDER - 1 page:
>[   MAX_ ORDER - 1 ]
>[ pageblock 0 | pageblock 1]
>
> Assume pageblock 0 contains unmovable data but 

Re: [PATCH] powerpc/pseries: delete scanlog

2021-11-16 Thread Nathan Lynch
Nathan Lynch  writes:
> Remove the pseries scanlog driver.
>
> This code supports functions from Power4-era servers that are not present
> on targets currently supported by arch/powerpc. System manuals from this
> time have this description:
>
>   Scan Dump data is a set of chip data that the service processor gathers
>   after a system malfunction. It consists of chip scan rings, chip trace
>   arrays, and Scan COM (SCOM) registers. This data is stored in the
>   scan-log partition of the system’s Nonvolatile Random Access
>   Memory (NVRAM).
>
> PowerVM partition firmware development doesn't recognize the associated
> function call or property, and they don't see any references to them in
> their codebase. It seems to have been specific to non-virtualized
> pseries.

Just bumping this to see if there are any objections.


Re: Build regressions/improvements in v5.16-rc1

2021-11-16 Thread Randy Dunlap

On 11/16/21 5:59 PM, Nick Terrell wrote:




On Nov 15, 2021, at 8:44 AM, Helge Deller  wrote:

On 11/15/21 17:12, Geert Uytterhoeven wrote:

On Mon, Nov 15, 2021 at 4:54 PM Geert Uytterhoeven  wrote:

Below is the list of build error/warning regressions/improvements in
v5.16-rc1[1] compared to v5.15[2].

Summarized:
  - build errors: +20/-13
  - build warnings: +3/-28

Happy fixing! ;-)

Thanks to the linux-next team for providing the build service.

[1] 
http://kisskb.ellerman.id.au/kisskb/branch/linus/head/fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf/
  (all 90 configs)
[2] 
http://kisskb.ellerman.id.au/kisskb/branch/linus/head/8bb7eca972ad531c9b149c0a51ab43a417385813/
  (all 90 configs)


*** ERRORS ***

20 error regressions:
  + /kisskb/src/arch/parisc/include/asm/jump_label.h: error: expected ':' before 
'__stringify':  => 33:4, 18:4
  + /kisskb/src/arch/parisc/include/asm/jump_label.h: error: label 'l_yes' defined 
but not used [-Werror=unused-label]:  => 38:1, 23:1


due to static_branch_likely() in crypto/api.c

parisc-allmodconfig


fixed now in the parisc for-next git tree.



  + /kisskb/src/drivers/gpu/drm/msm/msm_drv.h: error: "COND" redefined [-Werror]:  
=> 531
  + /kisskb/src/lib/zstd/compress/zstd_double_fast.c: error: the frame size of 
3252 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]:  => 47:1
  + /kisskb/src/lib/zstd/compress/zstd_double_fast.c: error: the frame size of 
3360 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]:  => 499:1
  + /kisskb/src/lib/zstd/compress/zstd_double_fast.c: error: the frame size of 
5344 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]:  => 334:1
  + /kisskb/src/lib/zstd/compress/zstd_double_fast.c: error: the frame size of 
5380 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]:  => 354:1
  + /kisskb/src/lib/zstd/compress/zstd_fast.c: error: the frame size of 1824 bytes 
is larger than 1536 bytes [-Werror=frame-larger-than=]:  => 372:1
  + /kisskb/src/lib/zstd/compress/zstd_fast.c: error: the frame size of 2224 bytes 
is larger than 1536 bytes [-Werror=frame-larger-than=]:  => 204:1
  + /kisskb/src/lib/zstd/compress/zstd_fast.c: error: the frame size of 3800 bytes 
is larger than 1536 bytes [-Werror=frame-larger-than=]:  => 476:1


parisc-allmodconfig


parisc needs much bigger frame sizes, so I'm not astonished here.
During the v5.15 cycl I increased it to 1536 (from 1280), so I'm simply tempted 
to
increase it this time to 4096, unless someone has a better idea


This patch set should fix the zstd stack size warnings [0]. I’ve
verified the fix using the same tooling: gcc-8-hppa-linux-gnu.

I’ll send the PR to Linus tomorrow. I’ve been informed that it
isn't strictly necessary to send the patches to the mailing list
for bug fixes, but its already done, so I’ll wait and see if there
is any feedback.


IMO several (or many more) people would disagree with that.

"strictly?"  OK, it's probably possible that almost any patch
could be merged without being on a mailing list, but it's not
desirable (except in the case of "security" patches).

--
~Randy


[powerpc:fixes-test] BUILD SUCCESS 1e35eba4055149c578baf0318d2f2f89ea3c44a0

2021-11-16 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
fixes-test
branch HEAD: 1e35eba4055149c578baf0318d2f2f89ea3c44a0  powerpc/8xx: Fix pinned 
TLBs with CONFIG_STRICT_KERNEL_RWX

elapsed time: 724m

configs tested: 190
configs skipped: 104

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
i386 randconfig-c001-2026
mips randconfig-c004-2026
powerpc mpc836x_rdk_defconfig
arm pxa_defconfig
powerpc mpc8313_rdb_defconfig
sh  r7785rp_defconfig
s390 allmodconfig
microblaze  mmu_defconfig
mips tb0287_defconfig
sh   sh7724_generic_defconfig
m68k   m5249evb_defconfig
x86_64  defconfig
armcerfcube_defconfig
powerpc   eiger_defconfig
mips loongson2k_defconfig
powerpcsocrates_defconfig
xtensa   common_defconfig
mips  decstation_64_defconfig
m68kq40_defconfig
archsdk_defconfig
arm s5pv210_defconfig
xtensa  iss_defconfig
powerpc mpc837x_rdb_defconfig
powerpc ps3_defconfig
sh  rts7751r2d1_defconfig
shtitan_defconfig
um i386_defconfig
ia64 allyesconfig
shhp6xx_defconfig
powerpc  ppc44x_defconfig
powerpc mpc85xx_cds_defconfig
s390   zfcpdump_defconfig
armmini2440_defconfig
microblaze  defconfig
sh  landisk_defconfig
arm at91_dt_defconfig
m68km5407c3_defconfig
openriscdefconfig
mips decstation_r4k_defconfig
powerpc pq2fads_defconfig
powerpc  walnut_defconfig
arm mxs_defconfig
mips  ath79_defconfig
powerpc  makalu_defconfig
mips   sb1250_swarm_defconfig
armclps711x_defconfig
arm socfpga_defconfig
powerpc   mpc834x_itxgp_defconfig
mips   rs90_defconfig
arm  exynos_defconfig
powerpccell_defconfig
arm axm55xx_defconfig
arm   sama5_defconfig
mips  rm200_defconfig
arm   stm32_defconfig
arm   spear13xx_defconfig
ia64 alldefconfig
mips  malta_defconfig
arm   versatile_defconfig
arm  gemini_defconfig
arm eseries_pxa_defconfig
arm   u8500_defconfig
mips cobalt_defconfig
sh   secureedge5410_defconfig
powerpc  bamboo_defconfig
h8300   defconfig
arm   aspeed_g4_defconfig
arm   imx_v6_v7_defconfig
powerpc mpc832x_mds_defconfig
s390 alldefconfig
arm   imx_v4_v5_defconfig
sh apsh4a3a_defconfig
armhisi_defconfig
mips  rb532_defconfig
ia64defconfig
armmagician_defconfig
powerpcmpc7448_hpc2_defconfig
arm  collie_defconfig
powerpc  ep88xc_defconfig
arc   tb10x_defconfig
armdove_defconfig
nios2 3c120_defconfig
arm   mainstone_defconfig
powerpc wii_defconfig
powerpc mpc512x_defconfig
sh kfr2r09-romimage_defconfig
sh  rsk7264_defconfig
m68k apollo_defconfig
shshmin_defconfig
h8300alldefconfig
mips rt305x_defconfig
armmps2_defconfig
xtensa  cadence_csp_defconfig
sh ap325rxa_defconfig
m68k amcore_defconfig
arm

[powerpc:fixes] BUILD SUCCESS 302039466f6a3b9421ecb9a6a2c528801dc24a86

2021-11-16 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
fixes
branch HEAD: 302039466f6a3b9421ecb9a6a2c528801dc24a86  powerpc/pseries: Fix 
numa FORM2 parsing fallback code

elapsed time: 725m

configs tested: 153
configs skipped: 108

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
i386 randconfig-c001-2026
mips randconfig-c004-2026
powerpc mpc836x_rdk_defconfig
arm pxa_defconfig
powerpc mpc8313_rdb_defconfig
sh  r7785rp_defconfig
s390 allmodconfig
microblaze  mmu_defconfig
mips tb0287_defconfig
sh   sh7724_generic_defconfig
m68k   m5249evb_defconfig
armcerfcube_defconfig
powerpc   eiger_defconfig
m68kq40_defconfig
archsdk_defconfig
arm s5pv210_defconfig
xtensa  iss_defconfig
powerpc mpc837x_rdb_defconfig
powerpc ps3_defconfig
sh  rts7751r2d1_defconfig
shtitan_defconfig
um i386_defconfig
ia64 allyesconfig
s390   zfcpdump_defconfig
armmini2440_defconfig
microblaze  defconfig
sh  landisk_defconfig
arm at91_dt_defconfig
m68km5407c3_defconfig
openriscdefconfig
mips decstation_r4k_defconfig
arm axm55xx_defconfig
arm   sama5_defconfig
mips  rm200_defconfig
arm   stm32_defconfig
arm   spear13xx_defconfig
ia64 alldefconfig
powerpc   mpc834x_itxgp_defconfig
mips  malta_defconfig
arm   versatile_defconfig
arm   imx_v4_v5_defconfig
sh apsh4a3a_defconfig
armhisi_defconfig
mips  rb532_defconfig
ia64defconfig
armmagician_defconfig
powerpcmpc7448_hpc2_defconfig
mips  ath79_defconfig
arm  collie_defconfig
nios2 3c120_defconfig
arm   mainstone_defconfig
powerpc wii_defconfig
powerpc mpc512x_defconfig
powerpc mpc85xx_cds_defconfig
sh kfr2r09-romimage_defconfig
sh  rsk7264_defconfig
m68k apollo_defconfig
shshmin_defconfig
h8300alldefconfig
mips rt305x_defconfig
powerpc mpc832x_mds_defconfig
armmps2_defconfig
xtensa  cadence_csp_defconfig
sh ap325rxa_defconfig
m68k amcore_defconfig
armmvebu_v7_defconfig
shdreamcast_defconfig
armvt8500_v6_v7_defconfig
arcnsim_700_defconfig
arm   milbeaut_m10v_defconfig
sh   sh7770_generic_defconfig
xtensa virt_defconfig
ia64  tiger_defconfig
arm  randconfig-c002-2026
ia64 allmodconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparc

[powerpc:next-test] BUILD SUCCESS 098a78a0f7f776200f1c5694983a37b77c522dbb

2021-11-16 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
next-test
branch HEAD: 098a78a0f7f776200f1c5694983a37b77c522dbb  powerpc/code-patching: 
Improve verification of patchability

elapsed time: 721m

configs tested: 195
configs skipped: 4

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
i386 randconfig-c001-2026
mips randconfig-c004-2026
powerpc mpc836x_rdk_defconfig
arm pxa_defconfig
powerpc mpc8313_rdb_defconfig
sh  r7785rp_defconfig
s390 allmodconfig
microblaze  mmu_defconfig
mips tb0287_defconfig
sh   sh7724_generic_defconfig
m68k   m5249evb_defconfig
x86_64  defconfig
armcerfcube_defconfig
powerpc   eiger_defconfig
mips loongson2k_defconfig
powerpcsocrates_defconfig
xtensa   common_defconfig
mips  decstation_64_defconfig
m68kq40_defconfig
archsdk_defconfig
arm s5pv210_defconfig
xtensa  iss_defconfig
powerpc mpc837x_rdb_defconfig
powerpc ps3_defconfig
sh  rts7751r2d1_defconfig
shtitan_defconfig
um i386_defconfig
ia64 allyesconfig
shhp6xx_defconfig
powerpc  ppc44x_defconfig
powerpc mpc85xx_cds_defconfig
s390   zfcpdump_defconfig
armmini2440_defconfig
microblaze  defconfig
mips  malta_kvm_defconfig
sparc64  alldefconfig
mips  ath25_defconfig
arm  simpad_defconfig
sh  landisk_defconfig
arm at91_dt_defconfig
m68km5407c3_defconfig
openriscdefconfig
mips decstation_r4k_defconfig
powerpc pq2fads_defconfig
powerpc  walnut_defconfig
arm mxs_defconfig
mips  ath79_defconfig
powerpc  makalu_defconfig
mips   sb1250_swarm_defconfig
armclps711x_defconfig
arm socfpga_defconfig
powerpc   mpc834x_itxgp_defconfig
mips   rs90_defconfig
arm  exynos_defconfig
powerpccell_defconfig
arm axm55xx_defconfig
arm   sama5_defconfig
mips  rm200_defconfig
arm   stm32_defconfig
arm   spear13xx_defconfig
ia64 alldefconfig
mips  malta_defconfig
arm   versatile_defconfig
sh   secureedge5410_defconfig
powerpc  bamboo_defconfig
h8300   defconfig
arm   aspeed_g4_defconfig
arm   imx_v6_v7_defconfig
powerpc mpc832x_mds_defconfig
s390 alldefconfig
arm   imx_v4_v5_defconfig
sh apsh4a3a_defconfig
armhisi_defconfig
mips  rb532_defconfig
arm hackkit_defconfig
armqcom_defconfig
sh  r7780mp_defconfig
shmigor_defconfig
arm orion5x_defconfig
ia64defconfig
armmagician_defconfig
powerpcmpc7448_hpc2_defconfig
arm  collie_defconfig
powerpc  ep88xc_defconfig
arc   tb10x_defconfig
armdove_defconfig
nios2 3c120_defconfig
arm   mainstone_defconfig
powerpc wii_defconfig
powerpc mpc512x_defconfig
sh kfr2r09-romimage_defconfig
sh  rsk7264_defconfig
m68k apollo_defconfig
shshmin_defconfig
h8300alldefconfig
mips

[powerpc:merge] BUILD SUCCESS 64044d20a2d777d2346cde7eb41d17497c4f4449

2021-11-16 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
merge
branch HEAD: 64044d20a2d777d2346cde7eb41d17497c4f4449  Automatic merge of 
'fixes' into merge (2021-11-16 22:46)

elapsed time: 721m

configs tested: 140
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
i386 randconfig-c001-2026
mips tb0287_defconfig
sh   sh7724_generic_defconfig
m68k   m5249evb_defconfig
x86_64  defconfig
armcerfcube_defconfig
powerpc   eiger_defconfig
mips cu1830-neo_defconfig
powerpc  pcm030_defconfig
arm   sama5_defconfig
sparc64  alldefconfig
s390   zfcpdump_defconfig
armmini2440_defconfig
microblaze  defconfig
mips  cavium_octeon_defconfig
powerpcadder875_defconfig
m68k amcore_defconfig
powerpc  ppc40x_defconfig
arcvdk_hs38_defconfig
armmvebu_v5_defconfig
arm mv78xx0_defconfig
sh   se7619_defconfig
xtensageneric_kc705_defconfig
powerpc64   defconfig
arm axm55xx_defconfig
mips  rm200_defconfig
arm   stm32_defconfig
arm   spear13xx_defconfig
ia64 alldefconfig
powerpc ps3_defconfig
powerpc mpc836x_rdk_defconfig
arm  gemini_defconfig
arm eseries_pxa_defconfig
arm   u8500_defconfig
mips cobalt_defconfig
armlart_defconfig
powerpc pseries_defconfig
shhp6xx_defconfig
mips  maltaaprp_defconfig
powerpc xes_mpc85xx_defconfig
arcvdk_hs38_smp_defconfig
shapsh4ad0a_defconfig
s390 alldefconfig
s390  debug_defconfig
sh kfr2r09-romimage_defconfig
sh  rsk7264_defconfig
m68k apollo_defconfig
shshmin_defconfig
h8300alldefconfig
mips rt305x_defconfig
powerpc mpc832x_mds_defconfig
armmps2_defconfig
xtensa  cadence_csp_defconfig
riscv allnoconfig
armvt8500_v6_v7_defconfig
arcnsim_700_defconfig
arm   milbeaut_m10v_defconfig
sh   sh7770_generic_defconfig
powerpc mpc85xx_cds_defconfig
arm  randconfig-c002-2026
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
i386  debian-10.3
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
x86_64   randconfig-a015-2026
x86_64   randconfig-a013-2026
x86_64

[PATCH] powerpc/rtas: kernel-doc fixes

2021-11-16 Thread Nathan Lynch
Fix the following issues reported by kernel-doc:

$ scripts/kernel-doc -v -none arch/powerpc/kernel/rtas.c
arch/powerpc/kernel/rtas.c:810: info: Scanning doc for function 
rtas_activate_firmware
arch/powerpc/kernel/rtas.c:818: warning: contents before sections
arch/powerpc/kernel/rtas.c:841: info: Scanning doc for function 
rtas_call_reentrant
arch/powerpc/kernel/rtas.c:893: warning: This comment starts with '/**', but 
isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
 * Find a specific pseries error log in an RTAS extended event log.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/kernel/rtas.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index ff80bbad22a5..ca27421f471a 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -809,13 +809,13 @@ void rtas_os_term(char *str)
 /**
  * rtas_activate_firmware() - Activate a new version of firmware.
  *
+ * Context: This function may sleep.
+ *
  * Activate a new version of partition firmware. The OS must call this
  * after resuming from a partition hibernation or migration in order
  * to maintain the ability to perform live firmware updates. It's not
  * catastrophic for this method to be absent or to fail; just log the
  * condition in that case.
- *
- * Context: This function may sleep.
  */
 void rtas_activate_firmware(void)
 {
@@ -890,11 +890,12 @@ int rtas_call_reentrant(int token, int nargs, int nret, 
int *outputs, ...)
 #endif /* CONFIG_PPC_PSERIES */
 
 /**
- * Find a specific pseries error log in an RTAS extended event log.
+ * get_pseries_errorlog() - Find a specific pseries error log in an RTAS
+ *  extended event log.
  * @log: RTAS error/event log
  * @section_id: two character section identifier
  *
- * Returns a pointer to the specified errorlog or NULL if not found.
+ * Return: A pointer to the specified errorlog or NULL if not found.
  */
 struct pseries_errorlog *get_pseries_errorlog(struct rtas_error_log *log,
  uint16_t section_id)
-- 
2.31.1



RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to unrecoverable loop.

2021-11-16 Thread Eugene Bordenkircher
On 02.11.21 22:15, Joakim Tjernlund wrote:
> On Sat, 2021-10-30 at 14:20 +, Joakim Tjernlund wrote:
>> On Fri, 2021-10-29 at 17:14 +, Eugene Bordenkircher wrote:
>
>>> We've discovered a situation where the FSL udc driver 
>>> (drivers/usb/gadget/udc/fsl_udc_core.c) will enter a loop iterating over 
>>> the request queue, but the queue has been corrupted at some point so it 
>>> loops infinitely.  I believe we have narrowed into the offending code, but 
>>> we are in need of assistance trying to find an appropriate fix for the 
>>> problem.  The identified code appears to be in all versions of the Linux 
>>> kernel the driver exists in.
>>>
>>> The problem appears to be when handling a USB_REQ_GET_STATUS request.  The 
>>> driver gets this request and then calls the ch9getstatus() function.  In 
>>> this function, it starts a request by "borrowing" the per device 
>>> status_req, filling it in, and then queuing it with a call to 
>>> list_add_tail() to add the request to the endpoint queue.  Right before it 
>>> exits the function however, it's calling ep0_prime_status(), which is 
>>> filling out that same status_req structure and then queuing it with another 
>>> call to list_add_tail() to add the request to the endpoint queue.  This 
>>> adds two instances of the exact same LIST_HEAD to the endpoint queue, which 
>>> breaks the list since the prev and next pointers end up pointing to the 
>>> wrong things.  This ends up causing a hard loop the next time nuke() gets 
>>> called, which happens on the next setup IRQ.
>>>
>>> I'm not sure what the appropriate fix to this problem is, mostly due to my 
>>> lack of expertise in USB and this driver stack.  The code has been this way 
>>> in the kernel for a very long time, which suggests that it has been 
>>> working, unless USB_REQ_GET_STATUS requests are never made.  This further 
>>> suggests that there is something else going on that I don't understand.  
>>> Deleting the call to ep0_prime_status() and the following ep0stall() call 
>>> appears, on the surface, to get the device working again, but may have side 
>>> effects that I'm not seeing.
>>>
>>> I'm hopeful someone in the community can help provide some information on 
>>> what I may be missing or help come up with a solution to the problem.  A 
>>> big thank you to anyone who would like to help out.
>>>
>>> Eugene
>>
>> Run into this to a while ago. Found the bug and a few more fixes.
>> This is against 4.19 so you may have to tweak them a bit.
>> Feel free to upstream them.
>>
>>  Jocke
>
> Curious, did my patches help? Good to known once we upgrade as well.
>
>  Jocke

There's good news and bad news.

The good news is that this appears to stop the driver from entering an infinite 
loop, which prevents the Linux system from locking up and never recovering.  So 
I'm willing to say we've made the behavior better.

The bad news is that once we get past this point, there is new bad behavior.  
What is on top of this driver in our system is the RNDIS gadget driver 
communicating to a Laptop running Win10 -1809.  Everything appears to work fine 
with the Linux system until there is a USB disconnect.  After the disconnect, 
the Linux side appears to continue on just fine, but the Windows side doesn't 
seem to recognize the disconnect, which causes the USB driver on that side to 
hang forever and eventually blue screen the box.  This doesn't happen on all 
machines, just a select few.   I think we can isolate the behavior to a 
specific antivirus/security software driver that is inserting itself into the 
USB stack and filtering the disconnect message, but we're still proving that.

I'm about 90% certain this is a different problem and we can call this patchset 
good, at least for our test setup.  My only hesitation is if the Linux side is 
sending a set of responses that are confusing the Windows side (specifically 
this antivirus) or not.  I'd be content calling that a separate defect though 
and letting this one close up with that patchset.

Eugene


Re: [PATCH 6/7] KVM: powerpc: Use Makefile.kvm for common files

2021-11-16 Thread David Woodhouse
On Tue, 2021-11-16 at 18:43 +, Sean Christopherson wrote:
> On Tue, Nov 16, 2021, David Woodhouse wrote:
> > From: David Woodhouse 
> > 
> > It's all fairly baroque but in the end, I don't think there's any reason
> > for $(KVM)/irqchip.o to have been handled differently, as they all end
> > up in $(kvm-y) in the end anyway, regardless of whether they get there
> > via $(common-objs-y) and the CPU-specific object lists.
> > 
> > The generic Makefile.kvm uses HAVE_KVM_IRQCHIP for irqchip.o instead of
> > HAVE_KVM_IRQ_ROUTING. That change is fine (and arguably correct) because
> > they are both set together for KVM_MPIC, or neither is set.
> 
> Nope.
> 
>   Symbol: HAVE_KVM_IRQCHIP [=y]
>   Type  : bool
>   Defined at virt/kvm/Kconfig:7
>   Selected by [m]:
> - KVM_XICS [=y] && VIRTUALIZATION [=y] && KVM_BOOK3S_64 [=m] && !KVM_MPIC 
> [=n]
>   Selected by [n]:
> - KVM_MPIC [=n] && VIRTUALIZATION [=y] && KVM [=y] && E500 [=n]
> 
> leads to this and a whole pile of other errors
> 
> arch/powerpc/kvm/../../../virt/kvm/irqchip.c: In function ‘kvm_irq_map_gsi’:
> arch/powerpc/kvm/../../../virt/kvm/irqchip.c:31:35: error: invalid use of 
> undefined type ‘struct kvm_irq_routing_table’
>31 | if (irq_rt && gsi < irq_rt->nr_rt_entries) {
>   |   ^~
> 

Hm, perhaps it should have been like this then (incremental):

+++ b/virt/kvm/Makefile.kvm
@@ -9,6 +9,6 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o 
$(KVM)/binary_stats.o
 kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o
 kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
-kvm-$(CONFIG_HAVE_KVM_IRQCHIP) += $(KVM)/irqchip.o
+kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
 kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
 kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o


> Side topic, please don't post a new version/series in-reply-to a different 
> series.
> b4 also gets confused in this case, e.g. it tried to grab the original patch. 
>  b4
> has also made me really lazy, heaven forbid I actually had to manually grab 
> these
> from mutt :-)

Sorry ;)

I think that one might even be a new series in reply to what was
already a second series on top of what I was *actually* trying to do
when I first started shaving this yak. Or maybe what I was originally
trying to implement has already been lost in the noise :)


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain

2021-11-16 Thread Cédric Le Goater

On 11/16/21 17:58, Marc Zyngier wrote:

On Tue, 16 Nov 2021 13:40:22 +,
Cédric Le Goater  wrote:


Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains
exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the
'nomap' domains still in use under the powerpc arch. With this new
flag, the revmap_tree of the IRQ domain is not used anymore. This
change broke the support of shared LSIs [1] in the XIVE driver because
it was relying on a lookup in the revmap_tree to query previously
mapped interrupts.


Just a lookup? Surely there is more to it, no?


nope. The HW IRQ for the INTx is defined in the DT. It is caught by
of_irq_parse_and_map_pci() which simply adds an extra mapping on the
same INTx since the previous one is not found.

Using an INTx is quite rare now days and a shared one is even more
uncommon I guess, I could only reproduced on the baremetal platform
with the QEMU PowerNV machine using the same virtio devices.

Thanks,

C.
 



Linux now creates two distinct IRQ mappings on the
same HW IRQ which can lead to unexpected behavior in the drivers.

The XIVE IRQ domain is not a direct mapping domain and its HW IRQ
interrupt number space is rather large : 1M/socket on POWER9 and
POWER10, change the XIVE driver to use a 'tree' domain type instead.

[1] For instance, a linux KVM guest with virtio-rng and virtio-balloon
 devices.

Cc: Marc Zyngier 
Cc: sta...@vger.kernel.org # v5.14+
Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive")
Signed-off-by: Cédric Le Goater 
---

  Marc,

  The Fixes tag is there because the patch in question revealed that
  something was broken in XIVE. genirq is not in cause. However, I
  don't know for PS3 and Cell. May be less critical for now.


Depends if they expect something that a no-map domain cannot provide.> 
  
  arch/powerpc/sysdev/xive/common.c | 3 +--

  arch/powerpc/sysdev/xive/Kconfig  | 1 -
  2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index fed6fd16c8f4..9d0f0fe25598 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops = {
  
  static void __init xive_init_host(struct device_node *np)

  {
-   xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ,
-  _irq_domain_ops, NULL);
+   xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL);
if (WARN_ON(xive_irq_domain == NULL))
return;
irq_set_default_host(xive_irq_domain);
diff --git a/arch/powerpc/sysdev/xive/Kconfig b/arch/powerpc/sysdev/xive/Kconfig
index 97796c6b63f0..785c292d104b 100644
--- a/arch/powerpc/sysdev/xive/Kconfig
+++ b/arch/powerpc/sysdev/xive/Kconfig
@@ -3,7 +3,6 @@ config PPC_XIVE
bool
select PPC_SMP_MUXED_IPI
select HARDIRQS_SW_RESEND
-   select IRQ_DOMAIN_NOMAP
  
  config PPC_XIVE_NATIVE

bool


As long as this works, I'm happy with one less no-map user.

Acked-by: Marc Zyngier 

M.





Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain

2021-11-16 Thread Marc Zyngier
On Tue, 16 Nov 2021 13:40:22 +,
Cédric Le Goater  wrote:
> 
> Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains
> exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the
> 'nomap' domains still in use under the powerpc arch. With this new
> flag, the revmap_tree of the IRQ domain is not used anymore. This
> change broke the support of shared LSIs [1] in the XIVE driver because
> it was relying on a lookup in the revmap_tree to query previously
> mapped interrupts.

Just a lookup? Surely there is more to it, no?

> Linux now creates two distinct IRQ mappings on the
> same HW IRQ which can lead to unexpected behavior in the drivers.
> 
> The XIVE IRQ domain is not a direct mapping domain and its HW IRQ
> interrupt number space is rather large : 1M/socket on POWER9 and
> POWER10, change the XIVE driver to use a 'tree' domain type instead.
> 
> [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon
> devices.
> 
> Cc: Marc Zyngier 
> Cc: sta...@vger.kernel.org # v5.14+
> Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive")
> Signed-off-by: Cédric Le Goater 
> ---
> 
>  Marc,
> 
>  The Fixes tag is there because the patch in question revealed that
>  something was broken in XIVE. genirq is not in cause. However, I
>  don't know for PS3 and Cell. May be less critical for now.

Depends if they expect something that a no-map domain cannot provide.

>  
>  arch/powerpc/sysdev/xive/common.c | 3 +--
>  arch/powerpc/sysdev/xive/Kconfig  | 1 -
>  2 files changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index fed6fd16c8f4..9d0f0fe25598 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops 
> = {
>  
>  static void __init xive_init_host(struct device_node *np)
>  {
> - xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ,
> -_irq_domain_ops, NULL);
> + xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL);
>   if (WARN_ON(xive_irq_domain == NULL))
>   return;
>   irq_set_default_host(xive_irq_domain);
> diff --git a/arch/powerpc/sysdev/xive/Kconfig 
> b/arch/powerpc/sysdev/xive/Kconfig
> index 97796c6b63f0..785c292d104b 100644
> --- a/arch/powerpc/sysdev/xive/Kconfig
> +++ b/arch/powerpc/sysdev/xive/Kconfig
> @@ -3,7 +3,6 @@ config PPC_XIVE
>   bool
>   select PPC_SMP_MUXED_IPI
>   select HARDIRQS_SW_RESEND
> - select IRQ_DOMAIN_NOMAP
>  
>  config PPC_XIVE_NATIVE
>   bool

As long as this works, I'm happy with one less no-map user.

Acked-by: Marc Zyngier 

M.

-- 
Without deviation from the norm, progress is not possible.


Re: [PATCH 4/5] KVM: x86: Use kvm_get_vcpu() instead of open-coded access

2021-11-16 Thread Paolo Bonzini

On 11/16/21 17:07, Sean Christopherson wrote:

  if (!kvm_arch_has_assigned_device(kvm) ||
  !irq_remapping_cap(IRQ_POSTING_CAP) ||
-   !kvm_vcpu_apicv_active(kvm->vcpus[0]))
+   !irqchip_in_kernel(kvm) || !enable_apicv)
  return 0;

  idx = srcu_read_lock(>irq_srcu);

What happens then if pi_pre_block is called and the IRTE denotes a posted
interrupt?

I might be wrong, but it seems to me that you have to change all of the
occurrences this way.  As soon as enable_apicv is set, you need to go
through the POSTED_INTR_WAKEUP_VECTOR just in case.

Sorry, I didn't grok that at all.  All occurences of what?


Of the !assigned-device || !VTd-PI || !kvm_vcpu_apicv_active(vcpu) 
checks.  This way, CPUs are woken up correctly even if you have 
!kvm_vcpu_apicv_active(vcpu) but the IRTE is a posted-interrupt one.


Paolo



[PATCH v2 7/7] KVM: Convert kvm_for_each_vcpu() to using xa_for_each_range()

2021-11-16 Thread Marc Zyngier
Now that the vcpu array is backed by an xarray, use the optimised
iterator that matches the underlying data structure.

Suggested-by: Sean Christopherson 
Signed-off-by: Marc Zyngier 
---
 include/linux/kvm_host.h | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dc635fbfd901..d20e33378c63 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -705,11 +705,9 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm 
*kvm, int i)
return xa_load(>vcpu_array, i);
 }
 
-#define kvm_for_each_vcpu(idx, vcpup, kvm) \
-   for (idx = 0; \
-idx < atomic_read(>online_vcpus) && \
-(vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
-idx++)
+#define kvm_for_each_vcpu(idx, vcpup, kvm)\
+   xa_for_each_range(>vcpu_array, idx, vcpup, 0, \
+ (atomic_read(>online_vcpus) - 1))
 
 static inline struct kvm_vcpu *kvm_get_vcpu_by_id(struct kvm *kvm, int id)
 {
-- 
2.30.2



[PATCH v2 6/7] KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s index

2021-11-16 Thread Marc Zyngier
Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu
index. Unfortunately, we're about to move rework the iterator,
which requires this to be upgrade to an unsigned long.

Let's bite the bullet and repaint all of it in one go.

Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/arch_timer.c   |  8 
 arch/arm64/kvm/arm.c  |  6 +++---
 arch/arm64/kvm/pmu-emul.c |  2 +-
 arch/arm64/kvm/psci.c |  6 +++---
 arch/arm64/kvm/reset.c|  2 +-
 arch/arm64/kvm/vgic/vgic-init.c   | 10 ++
 arch/arm64/kvm/vgic/vgic-kvm-device.c |  2 +-
 arch/arm64/kvm/vgic/vgic-mmio-v2.c|  3 +--
 arch/arm64/kvm/vgic/vgic-mmio-v3.c|  7 ---
 arch/arm64/kvm/vgic/vgic-v3.c |  4 ++--
 arch/arm64/kvm/vgic/vgic-v4.c |  5 +++--
 arch/arm64/kvm/vgic/vgic.c|  2 +-
 arch/powerpc/kvm/book3s_32_mmu.c  |  2 +-
 arch/powerpc/kvm/book3s_64_mmu.c  |  2 +-
 arch/powerpc/kvm/book3s_hv.c  |  8 
 arch/powerpc/kvm/book3s_pr.c  |  2 +-
 arch/powerpc/kvm/book3s_xics.c|  6 +++---
 arch/powerpc/kvm/book3s_xics.h|  2 +-
 arch/powerpc/kvm/book3s_xive.c| 15 +--
 arch/powerpc/kvm/book3s_xive.h|  4 ++--
 arch/powerpc/kvm/book3s_xive_native.c |  8 
 arch/powerpc/kvm/e500_emulate.c   |  2 +-
 arch/riscv/kvm/vcpu_sbi.c |  2 +-
 arch/riscv/kvm/vmid.c |  2 +-
 arch/s390/kvm/interrupt.c |  2 +-
 arch/s390/kvm/kvm-s390.c  | 21 +++--
 arch/s390/kvm/kvm-s390.h  |  4 ++--
 arch/x86/kvm/hyperv.c |  7 ---
 arch/x86/kvm/i8254.c  |  2 +-
 arch/x86/kvm/i8259.c  |  5 +++--
 arch/x86/kvm/ioapic.c |  4 ++--
 arch/x86/kvm/irq_comm.c   |  7 ---
 arch/x86/kvm/kvm_onhyperv.c   |  3 ++-
 arch/x86/kvm/lapic.c  |  6 +++---
 arch/x86/kvm/svm/avic.c   |  2 +-
 arch/x86/kvm/svm/sev.c|  3 ++-
 arch/x86/kvm/x86.c| 21 +++--
 include/linux/kvm_host.h  |  2 +-
 virt/kvm/kvm_main.c   | 13 +++--
 39 files changed, 114 insertions(+), 100 deletions(-)

diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 3df67c127489..d6f4114f1d11 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -750,7 +750,7 @@ int kvm_timer_vcpu_reset(struct kvm_vcpu *vcpu)
 /* Make the updates of cntvoff for all vtimer contexts atomic */
 static void update_vtimer_cntvoff(struct kvm_vcpu *vcpu, u64 cntvoff)
 {
-   int i;
+   unsigned long i;
struct kvm *kvm = vcpu->kvm;
struct kvm_vcpu *tmp;
 
@@ -1189,8 +1189,8 @@ void kvm_timer_vcpu_terminate(struct kvm_vcpu *vcpu)
 
 static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu)
 {
-   int vtimer_irq, ptimer_irq;
-   int i, ret;
+   int vtimer_irq, ptimer_irq, ret;
+   unsigned long i;
 
vtimer_irq = vcpu_vtimer(vcpu)->irq.irq;
ret = kvm_vgic_set_owner(vcpu, vtimer_irq, vcpu_vtimer(vcpu));
@@ -1297,7 +1297,7 @@ void kvm_timer_init_vhe(void)
 static void set_timer_irqs(struct kvm *kvm, int vtimer_irq, int ptimer_irq)
 {
struct kvm_vcpu *vcpu;
-   int i;
+   unsigned long i;
 
kvm_for_each_vcpu(i, vcpu, kvm) {
vcpu_vtimer(vcpu)->irq.irq = vtimer_irq;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 29942d8c357c..6177b0c882ea 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -624,7 +624,7 @@ bool kvm_arch_intc_initialized(struct kvm *kvm)
 
 void kvm_arm_halt_guest(struct kvm *kvm)
 {
-   int i;
+   unsigned long i;
struct kvm_vcpu *vcpu;
 
kvm_for_each_vcpu(i, vcpu, kvm)
@@ -634,7 +634,7 @@ void kvm_arm_halt_guest(struct kvm *kvm)
 
 void kvm_arm_resume_guest(struct kvm *kvm)
 {
-   int i;
+   unsigned long i;
struct kvm_vcpu *vcpu;
 
kvm_for_each_vcpu(i, vcpu, kvm) {
@@ -2020,7 +2020,7 @@ static int finalize_hyp_mode(void)
 struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr)
 {
struct kvm_vcpu *vcpu;
-   int i;
+   unsigned long i;
 
mpidr &= MPIDR_HWID_BITMASK;
kvm_for_each_vcpu(i, vcpu, kvm) {
diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index a5e4bbf5e68f..0404357705a8 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -900,7 +900,7 @@ static int kvm_arm_pmu_v3_init(struct kvm_vcpu *vcpu)
  */
 static bool pmu_irq_is_valid(struct kvm *kvm, int irq)
 {
-   int i;
+   unsigned long i;
struct kvm_vcpu *vcpu;
 
kvm_for_each_vcpu(i, vcpu, kvm) {
diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
index 74c47d420253..ed675fce8fb7 100644
--- a/arch/arm64/kvm/psci.c
+++ b/arch/arm64/kvm/psci.c
@@ -121,8 +121,8 @@ static unsigned long 

[PATCH v2 5/7] KVM: Convert the kvm->vcpus array to a xarray

2021-11-16 Thread Marc Zyngier
At least on arm64 and x86, the vcpus array is pretty huge (up to
1024 entries on x86) and is mostly empty in the majority of the cases
(running 1k vcpu VMs is not that common).

This mean that we end-up with a 4kB block of unused memory in the
middle of the kvm structure.

Instead of wasting away this memory, let's use an xarray instead,
which gives us almost the same flexibility as a normal array, but
with a reduced memory usage with smaller VMs.

Signed-off-by: Marc Zyngier 
---
 include/linux/kvm_host.h |  5 +++--
 virt/kvm/kvm_main.c  | 15 +--
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9a8c997ff9bc..df1b2b183d94 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -552,7 +553,7 @@ struct kvm {
struct mutex slots_arch_lock;
struct mm_struct *mm; /* userspace tied to this vm */
struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
-   struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
+   struct xarray vcpu_array;
 
/* Used to wait for completion of MMU notifiers.  */
spinlock_t mn_invalidate_lock;
@@ -701,7 +702,7 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm 
*kvm, int i)
 
/* Pairs with smp_wmb() in kvm_vm_ioctl_create_vcpu.  */
smp_rmb();
-   return kvm->vcpus[i];
+   return xa_load(>vcpu_array, i);
 }
 
 #define kvm_for_each_vcpu(idx, vcpup, kvm) \
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 73811c3829a2..9a7bce944427 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -458,7 +458,7 @@ void kvm_destroy_vcpus(struct kvm *kvm)
 
kvm_for_each_vcpu(i, vcpu, kvm) {
kvm_vcpu_destroy(vcpu);
-   kvm->vcpus[i] = NULL;
+   xa_erase(>vcpu_array, i);
}
 
atomic_set(>online_vcpus, 0);
@@ -1063,6 +1063,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
mutex_init(>slots_arch_lock);
spin_lock_init(>mn_invalidate_lock);
rcuwait_init(>mn_memslots_update_rcuwait);
+   xa_init(>vcpu_array);
 
INIT_LIST_HEAD(>devices);
 
@@ -3658,7 +3659,10 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 
id)
}
 
vcpu->vcpu_idx = atomic_read(>online_vcpus);
-   BUG_ON(kvm->vcpus[vcpu->vcpu_idx]);
+   r = xa_insert(>vcpu_array, vcpu->vcpu_idx, vcpu, 
GFP_KERNEL_ACCOUNT);
+   BUG_ON(r == -EBUSY);
+   if (r)
+   goto unlock_vcpu_destroy;
 
/* Fill the stats id string for the vcpu */
snprintf(vcpu->stats_id, sizeof(vcpu->stats_id), "kvm-%d/vcpu-%d",
@@ -3668,15 +3672,14 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, 
u32 id)
kvm_get_kvm(kvm);
r = create_vcpu_fd(vcpu);
if (r < 0) {
+   xa_erase(>vcpu_array, vcpu->vcpu_idx);
kvm_put_kvm_no_destroy(kvm);
goto unlock_vcpu_destroy;
}
 
-   kvm->vcpus[vcpu->vcpu_idx] = vcpu;
-
/*
-* Pairs with smp_rmb() in kvm_get_vcpu.  Write kvm->vcpus
-* before kvm->online_vcpu's incremented value.
+* Pairs with smp_rmb() in kvm_get_vcpu.  Store the vcpu
+* pointer before kvm->online_vcpu's incremented value.
 */
smp_wmb();
atomic_inc(>online_vcpus);
-- 
2.30.2



[PATCH v2 4/7] KVM: x86: Use kvm_get_vcpu() instead of open-coded access

2021-11-16 Thread Marc Zyngier
As we are about to change the way vcpus are allocated, mandate
the use of kvm_get_vcpu() instead of open-coding the access.

Signed-off-by: Marc Zyngier 
---
 arch/x86/kvm/vmx/posted_intr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 5f81ef092bd4..82a49720727d 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -272,7 +272,7 @@ int pi_update_irte(struct kvm *kvm, unsigned int host_irq, 
uint32_t guest_irq,
 
if (!kvm_arch_has_assigned_device(kvm) ||
!irq_remapping_cap(IRQ_POSTING_CAP) ||
-   !kvm_vcpu_apicv_active(kvm->vcpus[0]))
+   !kvm_vcpu_apicv_active(kvm_get_vcpu(kvm, 0)))
return 0;
 
idx = srcu_read_lock(>irq_srcu);
-- 
2.30.2



[PATCH v2 3/7] KVM: s390: Use kvm_get_vcpu() instead of open-coded access

2021-11-16 Thread Marc Zyngier
As we are about to change the way vcpus are allocated, mandate
the use of kvm_get_vcpu() instead of open-coding the access.

Reviewed-by: Claudio Imbrenda 
Signed-off-by: Marc Zyngier 
---
 arch/s390/kvm/kvm-s390.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 7af53b8788fa..4a0f62b03964 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4572,7 +4572,7 @@ int kvm_s390_vcpu_start(struct kvm_vcpu *vcpu)
}
 
for (i = 0; i < online_vcpus; i++) {
-   if (!is_vcpu_stopped(vcpu->kvm->vcpus[i]))
+   if (!is_vcpu_stopped(kvm_get_vcpu(vcpu->kvm, i)))
started_vcpus++;
}
 
@@ -4634,9 +4634,11 @@ int kvm_s390_vcpu_stop(struct kvm_vcpu *vcpu)
__disable_ibs_on_vcpu(vcpu);
 
for (i = 0; i < online_vcpus; i++) {
-   if (!is_vcpu_stopped(vcpu->kvm->vcpus[i])) {
+   struct kvm_vcpu *tmp = kvm_get_vcpu(vcpu->kvm, i);
+
+   if (!is_vcpu_stopped(tmp)) {
started_vcpus++;
-   started_vcpu = vcpu->kvm->vcpus[i];
+   started_vcpu = tmp;
}
}
 
-- 
2.30.2



[PATCH v2 2/7] KVM: mips: Use kvm_get_vcpu() instead of open-coded access

2021-11-16 Thread Marc Zyngier
As we are about to change the way vcpus are allocated, mandate
the use of kvm_get_vcpu() instead of open-coding the access.

Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Marc Zyngier 
---
 arch/mips/kvm/loongson_ipi.c | 4 ++--
 arch/mips/kvm/mips.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/mips/kvm/loongson_ipi.c b/arch/mips/kvm/loongson_ipi.c
index 3681fc8fba38..5d53f32d837c 100644
--- a/arch/mips/kvm/loongson_ipi.c
+++ b/arch/mips/kvm/loongson_ipi.c
@@ -120,7 +120,7 @@ static int loongson_vipi_write(struct loongson_kvm_ipi *ipi,
s->status |= data;
irq.cpu = id;
irq.irq = 6;
-   kvm_vcpu_ioctl_interrupt(kvm->vcpus[id], );
+   kvm_vcpu_ioctl_interrupt(kvm_get_vcpu(kvm, id), );
break;
 
case CORE0_CLEAR_OFF:
@@ -128,7 +128,7 @@ static int loongson_vipi_write(struct loongson_kvm_ipi *ipi,
if (!s->status) {
irq.cpu = id;
irq.irq = -6;
-   kvm_vcpu_ioctl_interrupt(kvm->vcpus[id], );
+   kvm_vcpu_ioctl_interrupt(kvm_get_vcpu(kvm, id), );
}
break;
 
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index ceacca74f808..6228bf396d63 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -479,7 +479,7 @@ int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
if (irq->cpu == -1)
dvcpu = vcpu;
else
-   dvcpu = vcpu->kvm->vcpus[irq->cpu];
+   dvcpu = kvm_get_vcpu(vcpu->kvm, irq->cpu);
 
if (intr == 2 || intr == 3 || intr == 4 || intr == 6) {
kvm_mips_callbacks->queue_io_int(dvcpu, irq);
-- 
2.30.2



[PATCH v2 1/7] KVM: Move wiping of the kvm->vcpus array to common code

2021-11-16 Thread Marc Zyngier
All architectures have similar loops iterating over the vcpus,
freeing one vcpu at a time, and eventually wiping the reference
off the vcpus array. They are also inconsistently taking
the kvm->lock mutex when wiping the references from the array.

Make this code common, which will simplify further changes.
The locking is dropped altogether, as this should only be called
when there is no further references on the kvm structure.

Reviewed-by: Claudio Imbrenda 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/arm.c   | 10 +-
 arch/mips/kvm/mips.c   | 21 +
 arch/powerpc/kvm/powerpc.c | 10 +-
 arch/riscv/kvm/vm.c| 10 +-
 arch/s390/kvm/kvm-s390.c   | 18 +-
 arch/x86/kvm/x86.c |  9 +
 include/linux/kvm_host.h   |  2 +-
 virt/kvm/kvm_main.c| 17 +++--
 8 files changed, 22 insertions(+), 75 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2f03cbfefe67..29942d8c357c 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -175,19 +175,11 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, 
struct vm_fault *vmf)
  */
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
-   int i;
-
bitmap_free(kvm->arch.pmu_filter);
 
kvm_vgic_destroy(kvm);
 
-   for (i = 0; i < KVM_MAX_VCPUS; ++i) {
-   if (kvm->vcpus[i]) {
-   kvm_vcpu_destroy(kvm->vcpus[i]);
-   kvm->vcpus[i] = NULL;
-   }
-   }
-   atomic_set(>online_vcpus, 0);
+   kvm_destroy_vcpus(kvm);
 }
 
 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 562aa878b266..ceacca74f808 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -171,25 +171,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
return 0;
 }
 
-void kvm_mips_free_vcpus(struct kvm *kvm)
-{
-   unsigned int i;
-   struct kvm_vcpu *vcpu;
-
-   kvm_for_each_vcpu(i, vcpu, kvm) {
-   kvm_vcpu_destroy(vcpu);
-   }
-
-   mutex_lock(>lock);
-
-   for (i = 0; i < atomic_read(>online_vcpus); i++)
-   kvm->vcpus[i] = NULL;
-
-   atomic_set(>online_vcpus, 0);
-
-   mutex_unlock(>lock);
-}
-
 static void kvm_mips_free_gpa_pt(struct kvm *kvm)
 {
/* It should always be safe to remove after flushing the whole range */
@@ -199,7 +180,7 @@ static void kvm_mips_free_gpa_pt(struct kvm *kvm)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
-   kvm_mips_free_vcpus(kvm);
+   kvm_destroy_vcpus(kvm);
kvm_mips_free_gpa_pt(kvm);
 }
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 35e9cccdeef9..492e4a4121cb 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -463,9 +463,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
-   unsigned int i;
-   struct kvm_vcpu *vcpu;
-
 #ifdef CONFIG_KVM_XICS
/*
 * We call kick_all_cpus_sync() to ensure that all
@@ -476,14 +473,9 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kick_all_cpus_sync();
 #endif
 
-   kvm_for_each_vcpu(i, vcpu, kvm)
-   kvm_vcpu_destroy(vcpu);
+   kvm_destroy_vcpus(kvm);
 
mutex_lock(>lock);
-   for (i = 0; i < atomic_read(>online_vcpus); i++)
-   kvm->vcpus[i] = NULL;
-
-   atomic_set(>online_vcpus, 0);
 
kvmppc_core_destroy_vm(kvm);
 
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 26399df15b63..6af6cde295eb 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -46,15 +46,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
-   int i;
-
-   for (i = 0; i < KVM_MAX_VCPUS; ++i) {
-   if (kvm->vcpus[i]) {
-   kvm_vcpu_destroy(kvm->vcpus[i]);
-   kvm->vcpus[i] = NULL;
-   }
-   }
-   atomic_set(>online_vcpus, 0);
+   kvm_destroy_vcpus(kvm);
 }
 
 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index c6257f625929..7af53b8788fa 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2819,27 +2819,11 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
free_page((unsigned long)(vcpu->arch.sie_block));
 }
 
-static void kvm_free_vcpus(struct kvm *kvm)
-{
-   unsigned int i;
-   struct kvm_vcpu *vcpu;
-
-   kvm_for_each_vcpu(i, vcpu, kvm)
-   kvm_vcpu_destroy(vcpu);
-
-   mutex_lock(>lock);
-   for (i = 0; i < atomic_read(>online_vcpus); i++)
-   kvm->vcpus[i] = NULL;
-
-   atomic_set(>online_vcpus, 0);
-   mutex_unlock(>lock);
-}
-
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
u16 rc, rrc;
 
-   kvm_free_vcpus(kvm);
+   

[PATCH v2 0/7] KVM: Turn the vcpu array into an xarray

2021-11-16 Thread Marc Zyngier
The kvm structure is pretty large. A large portion of it is the vcpu
array, which is 4kB on arm64 with 512 vcpu, double that on x86-64.  Of
course, hardly anyone runs VMs this big, so this is often a net waste
of memory and cache locality.

A possible approach is to turn the fixed-size array into an xarray,
which results in a net code deletion after a bit of cleanup.

* From v1:
  - Rebased on v5.16-rc1
  - Dropped the dubious locking on teardown
  - Converted kvm_for_each_vcpu() to xa_for_each_range(), together with
an invasive change converting the index to an unsigned long

Marc Zyngier (7):
  KVM: Move wiping of the kvm->vcpus array to common code
  KVM: mips: Use kvm_get_vcpu() instead of open-coded access
  KVM: s390: Use kvm_get_vcpu() instead of open-coded access
  KVM: x86: Use kvm_get_vcpu() instead of open-coded access
  KVM: Convert the kvm->vcpus array to a xarray
  KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s index
  KVM: Convert kvm_for_each_vcpu() to using xa_for_each_range()

 arch/arm64/kvm/arch_timer.c   |  8 ++---
 arch/arm64/kvm/arm.c  | 16 +++--
 arch/arm64/kvm/pmu-emul.c |  2 +-
 arch/arm64/kvm/psci.c |  6 ++--
 arch/arm64/kvm/reset.c|  2 +-
 arch/arm64/kvm/vgic/vgic-init.c   | 10 +++---
 arch/arm64/kvm/vgic/vgic-kvm-device.c |  2 +-
 arch/arm64/kvm/vgic/vgic-mmio-v2.c|  3 +-
 arch/arm64/kvm/vgic/vgic-mmio-v3.c|  7 ++--
 arch/arm64/kvm/vgic/vgic-v3.c |  4 +--
 arch/arm64/kvm/vgic/vgic-v4.c |  5 +--
 arch/arm64/kvm/vgic/vgic.c|  2 +-
 arch/mips/kvm/loongson_ipi.c  |  4 +--
 arch/mips/kvm/mips.c  | 23 ++---
 arch/powerpc/kvm/book3s_32_mmu.c  |  2 +-
 arch/powerpc/kvm/book3s_64_mmu.c  |  2 +-
 arch/powerpc/kvm/book3s_hv.c  |  8 ++---
 arch/powerpc/kvm/book3s_pr.c  |  2 +-
 arch/powerpc/kvm/book3s_xics.c|  6 ++--
 arch/powerpc/kvm/book3s_xics.h|  2 +-
 arch/powerpc/kvm/book3s_xive.c| 15 +
 arch/powerpc/kvm/book3s_xive.h|  4 +--
 arch/powerpc/kvm/book3s_xive_native.c |  8 ++---
 arch/powerpc/kvm/e500_emulate.c   |  2 +-
 arch/powerpc/kvm/powerpc.c| 10 +-
 arch/riscv/kvm/vcpu_sbi.c |  2 +-
 arch/riscv/kvm/vm.c   | 10 +-
 arch/riscv/kvm/vmid.c |  2 +-
 arch/s390/kvm/interrupt.c |  2 +-
 arch/s390/kvm/kvm-s390.c  | 47 ++-
 arch/s390/kvm/kvm-s390.h  |  4 +--
 arch/x86/kvm/hyperv.c |  7 ++--
 arch/x86/kvm/i8254.c  |  2 +-
 arch/x86/kvm/i8259.c  |  5 +--
 arch/x86/kvm/ioapic.c |  4 +--
 arch/x86/kvm/irq_comm.c   |  7 ++--
 arch/x86/kvm/kvm_onhyperv.c   |  3 +-
 arch/x86/kvm/lapic.c  |  6 ++--
 arch/x86/kvm/svm/avic.c   |  2 +-
 arch/x86/kvm/svm/sev.c|  3 +-
 arch/x86/kvm/vmx/posted_intr.c|  2 +-
 arch/x86/kvm/x86.c| 30 +++--
 include/linux/kvm_host.h  | 17 +-
 virt/kvm/kvm_main.c   | 41 ---
 44 files changed, 158 insertions(+), 193 deletions(-)

-- 
2.30.2



Re: [PATCH 0/5] KVM: Turn the vcpu array into an xarray

2021-11-16 Thread Marc Zyngier
On Tue, 16 Nov 2021 15:03:40 +,
Paolo Bonzini  wrote:
> 
> On 11/5/21 20:20, Marc Zyngier wrote:
> > The kvm structure is pretty large. A large portion of it is the vcpu
> > array, which is 4kB on x86_64 and arm64 as they deal with 512 vcpu
> > VMs. Of course, hardly anyone runs VMs this big, so this is often a
> > net waste of memory and cache locality.
> > 
> > A possible approach is to turn the fixed-size array into an xarray,
> > which results in a net code deletion after a bit of cleanup.
> > 
> > This series is on top of the current linux/master as it touches the
> > RISC-V implementation. Only tested on arm64.
> 
> Queued, only locally until I get a review for my replacement of patch
> 4 (see
> https://lore.kernel.org/kvm/2026142205.719375-1-pbonz...@redhat.com/T/).

In which case, let me send a v2 with the changes that we discussed
with Sean. It will still have my version of patch 4, but that's
nothing you can't fix.

M.

-- 
Without deviation from the norm, progress is not possible.


Re: [PATCH] powerpc: clean vdso32 and vdso64 directories

2021-11-16 Thread Christophe Leroy

Hi Masahiro,

Le 09/11/2021 à 19:50, Masahiro Yamada a écrit :

Since commit bce74491c300 ("powerpc/vdso: fix unnecessary rebuilds of
vgettimeofday.o"), "make ARCH=powerpc clean" does not clean up the
arch/powerpc/kernel/{vdso32,vdso64} directories.

Use the subdir- trick to let "make clean" descend into them.

Fixes: bce74491c300 ("powerpc/vdso: fix unnecessary rebuilds of 
vgettimeofday.o")
Signed-off-by: Masahiro Yamada 
---

  arch/powerpc/kernel/Makefile | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 0e3640e14eb1..5fa68c2ef1f8 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -196,3 +196,6 @@ clean-files := vmlinux.lds
  # Force dependency (incbin is bad)
  $(obj)/vdso32_wrapper.o : $(obj)/vdso32/vdso32.so.dbg
  $(obj)/vdso64_wrapper.o : $(obj)/vdso64/vdso64.so.dbg
+
+# for cleaning
+subdir- += vdso32 vdso64



This patch make me think about one thing I would have liked to do, but I 
don't know Makefiles well enough to be able to do it. You could probably 
help me with it.


vdso32 and vdso64 contain a lot of redundant sources. I would like to 
merge them into a new single directory, let say 'vdso', and use the 
files in that directory to build both vdso32.so and vdso64.so. I have a 
feeling that x86 is doing it that way, but I've not been able to figure 
out how to build two objects using the same C/S files.


Thanks
Christophe


Re: [PATCH v2 12/13] lkdtm: Fix execute_[user]_location()

2021-11-16 Thread Christophe Leroy

Hi Kees,

Le 16/10/2021 à 08:42, Christophe Leroy a écrit :



Le 15/10/2021 à 23:31, Kees Cook a écrit :

On Thu, Oct 14, 2021 at 07:50:01AM +0200, Christophe Leroy wrote:

execute_location() and execute_user_location() intent
to copy do_nothing() text and execute it at a new location.
However, at the time being it doesn't copy do_nothing() function
but do_nothing() function descriptor which still points to the
original text. So at the end it still executes do_nothing() at
its original location allthough using a copied function descriptor.

So, fix that by really copying do_nothing() text and build a new
function descriptor by copying do_nothing() function descriptor and
updating the target address with the new location.

Also fix the displayed addresses by dereferencing do_nothing()
function descriptor.

Signed-off-by: Christophe Leroy 
---
  drivers/misc/lkdtm/perms.c | 25 +
  include/asm-generic/sections.h |  5 +
  2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 5266dc28df6e..96b3ebfcb8ed 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -44,19 +44,32 @@ static noinline void do_overwritten(void)
  return;
  }
+static void *setup_function_descriptor(func_desc_t *fdesc, void *dst)
+{
+    memcpy(fdesc, do_nothing, sizeof(*fdesc));
+    fdesc->addr = (unsigned long)dst;
+    barrier();
+
+    return fdesc;
+}


How about collapsing the "have_function_descriptors()" check into
setup_function_descriptor()?

static void *setup_function_descriptor(func_desc_t *fdesc, void *dst)
{
if (__is_defined(HAVE_FUNCTION_DESCRIPTORS)) {
    memcpy(fdesc, do_nothing, sizeof(*fdesc));
    fdesc->addr = (unsigned long)dst;
    barrier();
    return fdesc;
} else {
    return dst;
}
}


Ok



...



diff --git a/include/asm-generic/sections.h 
b/include/asm-generic/sections.h

index 76163883c6ff..d225318538bd 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -70,6 +70,11 @@ typedef struct {
  } func_desc_t;
  #endif
+static inline bool have_function_descriptors(void)
+{
+    return __is_defined(HAVE_FUNCTION_DESCRIPTORS);
+}
+
  /* random extra sections (if any).  Override
   * in asm/sections.h */
  #ifndef arch_is_kernel_text


This hunk seems like it should live in a separate patch.



Ok I move it in a previous patch.



Do you have any additional feedback or comment on series v3 ?

What's the way forward, should it go via LKDTM tree or via powerpc tree 
or another tree ? I see there are neither Ack-by nor Reviewed-by for the 
last 2 patches.


Thanks
Christophe


Re: [PATCH 0/5] KVM: Turn the vcpu array into an xarray

2021-11-16 Thread Paolo Bonzini

On 11/5/21 20:20, Marc Zyngier wrote:

The kvm structure is pretty large. A large portion of it is the vcpu
array, which is 4kB on x86_64 and arm64 as they deal with 512 vcpu
VMs. Of course, hardly anyone runs VMs this big, so this is often a
net waste of memory and cache locality.

A possible approach is to turn the fixed-size array into an xarray,
which results in a net code deletion after a bit of cleanup.

This series is on top of the current linux/master as it touches the
RISC-V implementation. Only tested on arm64.


Queued, only locally until I get a review for my replacement of patch 4 
(see 
https://lore.kernel.org/kvm/2026142205.719375-1-pbonz...@redhat.com/T/).


Paolo


Marc Zyngier (5):
   KVM: Move wiping of the kvm->vcpus array to common code
   KVM: mips: Use kvm_get_vcpu() instead of open-coded access
   KVM: s390: Use kvm_get_vcpu() instead of open-coded access
   KVM: x86: Use kvm_get_vcpu() instead of open-coded access
   KVM: Convert the kvm->vcpus array to a xarray

  arch/arm64/kvm/arm.c   | 10 +-
  arch/mips/kvm/loongson_ipi.c   |  4 ++--
  arch/mips/kvm/mips.c   | 23 ++-
  arch/powerpc/kvm/powerpc.c | 10 +-
  arch/riscv/kvm/vm.c| 10 +-
  arch/s390/kvm/kvm-s390.c   | 26 ++
  arch/x86/kvm/vmx/posted_intr.c |  2 +-
  arch/x86/kvm/x86.c |  9 +
  include/linux/kvm_host.h   |  7 ---
  virt/kvm/kvm_main.c| 33 ++---
  10 files changed, 45 insertions(+), 89 deletions(-)





Re: [RESEND PATCH v5 0/4] Add perf interface to expose nvdimm

2021-11-16 Thread LEROY Christophe
Hi

Le 16/11/2021 à 05:49, Kajol Jain a écrit :
> Patchset adds performance stats reporting support for nvdimm.
> Added interface includes support for pmu register/unregister
> functions. A structure is added called nvdimm_pmu to be used for
> adding arch/platform specific data such as cpumask, nvdimm device
> pointer and pmu event functions like event_init/add/read/del.
> User could use the standard perf tool to access perf events
> exposed via pmu.
> 
> Interface also defines supported event list, config fields for the
> event attributes and their corresponding bit values which are exported
> via sysfs. Patch 3 exposes IBM pseries platform nmem* device
> performance stats using this interface.

You resending your v5 series ? Is there any news since your submittion 
last month that's awaiting in patchwork here at 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=264422 ?

Christophe


> 
> Result from power9 pseries lpar with 2 nvdimm device:
> 
> Ex: List all event by perf list
> 
> command:# perf list nmem
> 
>nmem0/cache_rh_cnt/[Kernel PMU event]
>nmem0/cache_wh_cnt/[Kernel PMU event]
>nmem0/cri_res_util/[Kernel PMU event]
>nmem0/ctl_res_cnt/ [Kernel PMU event]
>nmem0/ctl_res_tm/  [Kernel PMU event]
>nmem0/fast_w_cnt/  [Kernel PMU event]
>nmem0/host_l_cnt/  [Kernel PMU event]
>nmem0/host_l_dur/  [Kernel PMU event]
>nmem0/host_s_cnt/  [Kernel PMU event]
>nmem0/host_s_dur/  [Kernel PMU event]
>nmem0/med_r_cnt/   [Kernel PMU event]
>nmem0/med_r_dur/   [Kernel PMU event]
>nmem0/med_w_cnt/   [Kernel PMU event]
>nmem0/med_w_dur/   [Kernel PMU event]
>nmem0/mem_life/[Kernel PMU event]
>nmem0/poweron_secs/[Kernel PMU event]
>...
>nmem1/mem_life/[Kernel PMU event]
>nmem1/poweron_secs/[Kernel PMU event]
> 
> Patch1:
>  Introduces the nvdimm_pmu structure
> Patch2:
>  Adds common interface to add arch/platform specific data
>  includes nvdimm device pointer, pmu data along with
>  pmu event functions. It also defines supported event list
>  and adds attribute groups for format, events and cpumask.
>  It also adds code for cpu hotplug support.
> Patch3:
>  Add code in arch/powerpc/platform/pseries/papr_scm.c to expose
>  nmem* pmu. It fills in the nvdimm_pmu structure with pmu name,
>  capabilities, cpumask and event functions and then registers
>  the pmu by adding callbacks to register_nvdimm_pmu.
> Patch4:
>  Sysfs documentation patch
> 
> Changelog
> ---
> v4 -> v5:
> - Remove multiple variables defined in nvdimm_pmu structure include
>name and pmu functions(event_int/add/del/read) as they are just
>used to copy them again in pmu variable. Now we are directly doing
>this step in arch specific code as suggested by Dan Williams.
> 
> - Remove attribute group field from nvdimm pmu structure and
>defined these attribute groups in common interface which
>includes format, event list along with cpumask as suggested by
>Dan Williams.
>Since we added static defination for attrbute groups needed in
>common interface, removes corresponding code from papr.
> 
> - Add nvdimm pmu event list with event codes in the common interface.
> 
> - Remove Acked-by/Reviewed-by/Tested-by tags as code is refactored
>to handle review comments from Dan.
> 
> - Make nvdimm_pmu_free_hotplug_memory function static as reported
>by kernel test robot, also add corresponding Reported-by tag.
> 
> - Link to the patchset v4: https://lkml.org/lkml/2021/9/3/45
> 
> v3 -> v4
> - Rebase code on top of current papr_scm code without any logical
>changes.
> 
> - Added Acked-by tag from Peter Zijlstra and Reviewed by tag
>from Madhavan Srinivasan.
> 
> - Link to the patchset v3: https://lkml.org/lkml/2021/6/17/605
> 
> v2 -> v3
> - Added Tested-by tag.
> 
> - Fix nvdimm mailing list in the ABI Documentation.
> 
> - Link to the patchset v2: https://lkml.org/lkml/2021/6/14/25
> 
> v1 -> v2
> - Fix hotplug code by adding pmu migration call
>incase current designated cpu got offline. As
>pointed by Peter Zijlstra.
> 
> - Removed the retun -1 part from cpu hotplug offline
>function.
> 
> - Link to the patchset v1: https://lkml.org/lkml/2021/6/8/500
> 
> Kajol Jain (4):
>drivers/nvdimm: Add nvdimm pmu structure
>drivers/nvdimm: Add perf interface 

Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain

2021-11-16 Thread Cédric Le Goater

On 11/16/21 15:23, Greg Kurz wrote:

On Tue, 16 Nov 2021 14:40:22 +0100
Cédric Le Goater  wrote:


Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains
exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the
'nomap' domains still in use under the powerpc arch. With this new
flag, the revmap_tree of the IRQ domain is not used anymore. This
change broke the support of shared LSIs [1] in the XIVE driver because
it was relying on a lookup in the revmap_tree to query previously
mapped interrupts. Linux now creates two distinct IRQ mappings on the
same HW IRQ which can lead to unexpected behavior in the drivers.

The XIVE IRQ domain is not a direct mapping domain and its HW IRQ
interrupt number space is rather large : 1M/socket on POWER9 and
POWER10, change the XIVE driver to use a 'tree' domain type instead.

[1] For instance, a linux KVM guest with virtio-rng and virtio-balloon
 devices.

Cc: Marc Zyngier 
Cc: sta...@vger.kernel.org # v5.14+
Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive")
Signed-off-by: Cédric Le Goater 
---



Tested-by: Greg Kurz 

with a KVM guest + virtio-rng + virtio-balloon on a POWER9 host.


Did you test on a 5.14 backport or mainline ?

I am asking because a large change adding support for MSI domains
to XIVE was merged in 5.15.

Thanks,

C.





  Marc,

  The Fixes tag is there because the patch in question revealed that
  something was broken in XIVE. genirq is not in cause. However, I
  don't know for PS3 and Cell. May be less critical for now.
  
  arch/powerpc/sysdev/xive/common.c | 3 +--

  arch/powerpc/sysdev/xive/Kconfig  | 1 -
  2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index fed6fd16c8f4..9d0f0fe25598 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops = {
  
  static void __init xive_init_host(struct device_node *np)

  {
-   xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ,
-  _irq_domain_ops, NULL);
+   xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL);
if (WARN_ON(xive_irq_domain == NULL))
return;
irq_set_default_host(xive_irq_domain);
diff --git a/arch/powerpc/sysdev/xive/Kconfig b/arch/powerpc/sysdev/xive/Kconfig
index 97796c6b63f0..785c292d104b 100644
--- a/arch/powerpc/sysdev/xive/Kconfig
+++ b/arch/powerpc/sysdev/xive/Kconfig
@@ -3,7 +3,6 @@ config PPC_XIVE
bool
select PPC_SMP_MUXED_IPI
select HARDIRQS_SW_RESEND
-   select IRQ_DOMAIN_NOMAP
  
  config PPC_XIVE_NATIVE

bool






Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain

2021-11-16 Thread Greg Kurz
On Tue, 16 Nov 2021 15:49:13 +0100
Cédric Le Goater  wrote:

> On 11/16/21 15:23, Greg Kurz wrote:
> > On Tue, 16 Nov 2021 14:40:22 +0100
> > Cédric Le Goater  wrote:
> > 
> >> Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains
> >> exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the
> >> 'nomap' domains still in use under the powerpc arch. With this new
> >> flag, the revmap_tree of the IRQ domain is not used anymore. This
> >> change broke the support of shared LSIs [1] in the XIVE driver because
> >> it was relying on a lookup in the revmap_tree to query previously
> >> mapped interrupts. Linux now creates two distinct IRQ mappings on the
> >> same HW IRQ which can lead to unexpected behavior in the drivers.
> >>
> >> The XIVE IRQ domain is not a direct mapping domain and its HW IRQ
> >> interrupt number space is rather large : 1M/socket on POWER9 and
> >> POWER10, change the XIVE driver to use a 'tree' domain type instead.
> >>
> >> [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon
> >>  devices.
> >>
> >> Cc: Marc Zyngier 
> >> Cc: sta...@vger.kernel.org # v5.14+
> >> Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains 
> >> exclusive")
> >> Signed-off-by: Cédric Le Goater 
> >> ---
> >>
> > 
> > Tested-by: Greg Kurz 
> > 
> > with a KVM guest + virtio-rng + virtio-balloon on a POWER9 host.
> 
> Did you test on a 5.14 backport or mainline ?
> 

I've tested on a 5.14 backport only.

> I am asking because a large change adding support for MSI domains
> to XIVE was merged in 5.15.
> 
> Thanks,
> 
> C.
> 
> 
> > 
> >>   Marc,
> >>
> >>   The Fixes tag is there because the patch in question revealed that
> >>   something was broken in XIVE. genirq is not in cause. However, I
> >>   don't know for PS3 and Cell. May be less critical for now.
> >>   
> >>   arch/powerpc/sysdev/xive/common.c | 3 +--
> >>   arch/powerpc/sysdev/xive/Kconfig  | 1 -
> >>   2 files changed, 1 insertion(+), 3 deletions(-)
> >>
> >> diff --git a/arch/powerpc/sysdev/xive/common.c 
> >> b/arch/powerpc/sysdev/xive/common.c
> >> index fed6fd16c8f4..9d0f0fe25598 100644
> >> --- a/arch/powerpc/sysdev/xive/common.c
> >> +++ b/arch/powerpc/sysdev/xive/common.c
> >> @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops 
> >> xive_irq_domain_ops = {
> >>   
> >>   static void __init xive_init_host(struct device_node *np)
> >>   {
> >> -  xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ,
> >> - _irq_domain_ops, NULL);
> >> +  xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL);
> >>if (WARN_ON(xive_irq_domain == NULL))
> >>return;
> >>irq_set_default_host(xive_irq_domain);
> >> diff --git a/arch/powerpc/sysdev/xive/Kconfig 
> >> b/arch/powerpc/sysdev/xive/Kconfig
> >> index 97796c6b63f0..785c292d104b 100644
> >> --- a/arch/powerpc/sysdev/xive/Kconfig
> >> +++ b/arch/powerpc/sysdev/xive/Kconfig
> >> @@ -3,7 +3,6 @@ config PPC_XIVE
> >>bool
> >>select PPC_SMP_MUXED_IPI
> >>select HARDIRQS_SW_RESEND
> >> -  select IRQ_DOMAIN_NOMAP
> >>   
> >>   config PPC_XIVE_NATIVE
> >>bool
> > 
> 



Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain

2021-11-16 Thread Greg Kurz
On Tue, 16 Nov 2021 14:40:22 +0100
Cédric Le Goater  wrote:

> Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains
> exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the
> 'nomap' domains still in use under the powerpc arch. With this new
> flag, the revmap_tree of the IRQ domain is not used anymore. This
> change broke the support of shared LSIs [1] in the XIVE driver because
> it was relying on a lookup in the revmap_tree to query previously
> mapped interrupts. Linux now creates two distinct IRQ mappings on the
> same HW IRQ which can lead to unexpected behavior in the drivers.
> 
> The XIVE IRQ domain is not a direct mapping domain and its HW IRQ
> interrupt number space is rather large : 1M/socket on POWER9 and
> POWER10, change the XIVE driver to use a 'tree' domain type instead.
> 
> [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon
> devices.
> 
> Cc: Marc Zyngier 
> Cc: sta...@vger.kernel.org # v5.14+
> Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive")
> Signed-off-by: Cédric Le Goater 
> ---
> 

Tested-by: Greg Kurz 

with a KVM guest + virtio-rng + virtio-balloon on a POWER9 host.

>  Marc,
> 
>  The Fixes tag is there because the patch in question revealed that
>  something was broken in XIVE. genirq is not in cause. However, I
>  don't know for PS3 and Cell. May be less critical for now. 
>  
>  arch/powerpc/sysdev/xive/common.c | 3 +--
>  arch/powerpc/sysdev/xive/Kconfig  | 1 -
>  2 files changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index fed6fd16c8f4..9d0f0fe25598 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops 
> = {
>  
>  static void __init xive_init_host(struct device_node *np)
>  {
> - xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ,
> -_irq_domain_ops, NULL);
> + xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL);
>   if (WARN_ON(xive_irq_domain == NULL))
>   return;
>   irq_set_default_host(xive_irq_domain);
> diff --git a/arch/powerpc/sysdev/xive/Kconfig 
> b/arch/powerpc/sysdev/xive/Kconfig
> index 97796c6b63f0..785c292d104b 100644
> --- a/arch/powerpc/sysdev/xive/Kconfig
> +++ b/arch/powerpc/sysdev/xive/Kconfig
> @@ -3,7 +3,6 @@ config PPC_XIVE
>   bool
>   select PPC_SMP_MUXED_IPI
>   select HARDIRQS_SW_RESEND
> - select IRQ_DOMAIN_NOMAP
>  
>  config PPC_XIVE_NATIVE
>   bool



Re: [PATCH 0/5] KVM: Turn the vcpu array into an xarray

2021-11-16 Thread Paolo Bonzini

On 11/16/21 15:13, Juergen Gross wrote:

On 05.11.21 20:20, Marc Zyngier wrote:

The kvm structure is pretty large. A large portion of it is the vcpu
array, which is 4kB on x86_64 and arm64 as they deal with 512 vcpu
VMs. Of course, hardly anyone runs VMs this big, so this is often a
net waste of memory and cache locality.

A possible approach is to turn the fixed-size array into an xarray,
which results in a net code deletion after a bit of cleanup.

This series is on top of the current linux/master as it touches the
RISC-V implementation. Only tested on arm64.

Marc Zyngier (5):
   KVM: Move wiping of the kvm->vcpus array to common code
   KVM: mips: Use kvm_get_vcpu() instead of open-coded access
   KVM: s390: Use kvm_get_vcpu() instead of open-coded access
   KVM: x86: Use kvm_get_vcpu() instead of open-coded access
   KVM: Convert the kvm->vcpus array to a xarray

  arch/arm64/kvm/arm.c   | 10 +-
  arch/mips/kvm/loongson_ipi.c   |  4 ++--
  arch/mips/kvm/mips.c   | 23 ++-
  arch/powerpc/kvm/powerpc.c | 10 +-
  arch/riscv/kvm/vm.c    | 10 +-
  arch/s390/kvm/kvm-s390.c   | 26 ++
  arch/x86/kvm/vmx/posted_intr.c |  2 +-
  arch/x86/kvm/x86.c |  9 +
  include/linux/kvm_host.h   |  7 ---
  virt/kvm/kvm_main.c    | 33 ++---
  10 files changed, 45 insertions(+), 89 deletions(-)



For x86 you can add my:

Tested-by: Juergen Gross 


Heh, unfortunately x86 is the only one that needs a change in patch 4. 
I'll Cc you on my version.


Paolo



Re: [PATCH 4/5] KVM: x86: Use kvm_get_vcpu() instead of open-coded access

2021-11-16 Thread Paolo Bonzini

On 11/5/21 21:03, Sean Christopherson wrote:

But I think even that is flawed, as APICv can be dynamically deactivated and
re-activated while the VM is running, and I don't see a path that re-updates
the IRTE when APICv is re-activated.  So I think a more conservative check is
needed, e.g.

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 5f81ef092bd4..6cf5b2e86118 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -272,7 +272,7 @@ int pi_update_irte(struct kvm *kvm, unsigned int host_irq, 
uint32_t guest_irq,

 if (!kvm_arch_has_assigned_device(kvm) ||
 !irq_remapping_cap(IRQ_POSTING_CAP) ||
-   !kvm_vcpu_apicv_active(kvm->vcpus[0]))
+   !irqchip_in_kernel(kvm) || !enable_apicv)
 return 0;

 idx = srcu_read_lock(>irq_srcu);


What happens then if pi_pre_block is called and the IRTE denotes a 
posted interrupt?


I might be wrong, but it seems to me that you have to change all of the 
occurrences this way.  As soon as enable_apicv is set, you need to go 
through the POSTED_INTR_WAKEUP_VECTOR just in case.


Paolo



Re: [PATCH 1/5] KVM: Move wiping of the kvm->vcpus array to common code

2021-11-16 Thread Paolo Bonzini

On 11/6/21 12:17, Marc Zyngier wrote:


If you too believe that this is just wrong, I'm happy to drop the
locking altogether. If that breaks someone's flow, they'll shout soon
enough.


Yes, it's not necessary.  It was added in 2009 (commit 988a2cae6a3c, 
"KVM: Use macro to iterate over vcpus.") and it was unnecessary back 
then too.


Paolo



Re: [PATCH v2 3/5] powerpc: Use preemption model accessors

2021-11-16 Thread Christophe Leroy




Le 10/11/2021 à 21:24, Valentin Schneider a écrit :

Per PREEMPT_DYNAMIC, checking CONFIG_PREEMPT doesn't tell you the actual
preemption model of the live kernel. Use the newly-introduced accessors
instead.


Is that change worth it for now ? As far as I can see powerpc doesn't 
have DYNAMIC PREEMPT, a lot of work needs to be done before being able 
to use it:

- Implement GENERIC_ENTRY
- Implement STATIC_CALLS (already done on PPC32, to be done on PPC64)



sched_init() -> preempt_dynamic_init() happens way before IRQs are set up,
so this should be fine.


It looks like you are mixing up interrupts and IRQs (also known as 
"external interrupts").


ISI (Instruction Storage Interrupt) and DSI (Data Storage Interrupt) for 
instance are also interrupts. They happen everytime there is a page 
fault so may happen pretty early.


Traps generated by WARN_ON() are also interrupts that may happen at any 
time.




Signed-off-by: Valentin Schneider 
---
  arch/powerpc/kernel/interrupt.c | 2 +-
  arch/powerpc/kernel/traps.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index de10a2697258..c56c10b59be3 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -552,7 +552,7 @@ notrace unsigned long interrupt_exit_kernel_prepare(struct 
pt_regs *regs)
/* Returning to a kernel context with local irqs enabled. */
WARN_ON_ONCE(!(regs->msr & MSR_EE));
  again:
-   if (IS_ENABLED(CONFIG_PREEMPT)) {
+   if (is_preempt_full()) {


I think the cost of that additionnal test should be analysed. Maybe it's 
worth not doing the test at all and check _TIF_NEED_RESCHED everytime, 
unless that recurrent test is changed into a jump label as suggested in 
patch 2.




/* Return to preemptible kernel context */
if (unlikely(current_thread_info()->flags & 
_TIF_NEED_RESCHED)) {
if (preempt_count() == 0)
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index aac8c0412ff9..1cb31bbdc925 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -265,7 +265,7 @@ static int __die(const char *str, struct pt_regs *regs, 
long err)
printk("%s PAGE_SIZE=%luK%s%s%s%s%s%s %s\n",
   IS_ENABLED(CONFIG_CPU_LITTLE_ENDIAN) ? "LE" : "BE",
   PAGE_SIZE / 1024, get_mmu_str(),
-  IS_ENABLED(CONFIG_PREEMPT) ? " PREEMPT" : "",
+  is_preempt_full() ? " PREEMPT" : "",
   IS_ENABLED(CONFIG_SMP) ? " SMP" : "",
   IS_ENABLED(CONFIG_SMP) ? (" NR_CPUS=" __stringify(NR_CPUS)) : "",
   debug_pagealloc_enabled() ? " DEBUG_PAGEALLOC" : "",



Would it be interesting as well to know that we are indeed in a DYNAMIC 
PREEMPT context when dying ?


Christophe


[PATCH] powerpc/xive: Change IRQ domain to a tree domain

2021-11-16 Thread Cédric Le Goater
Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains
exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the
'nomap' domains still in use under the powerpc arch. With this new
flag, the revmap_tree of the IRQ domain is not used anymore. This
change broke the support of shared LSIs [1] in the XIVE driver because
it was relying on a lookup in the revmap_tree to query previously
mapped interrupts. Linux now creates two distinct IRQ mappings on the
same HW IRQ which can lead to unexpected behavior in the drivers.

The XIVE IRQ domain is not a direct mapping domain and its HW IRQ
interrupt number space is rather large : 1M/socket on POWER9 and
POWER10, change the XIVE driver to use a 'tree' domain type instead.

[1] For instance, a linux KVM guest with virtio-rng and virtio-balloon
devices.

Cc: Marc Zyngier 
Cc: sta...@vger.kernel.org # v5.14+
Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive")
Signed-off-by: Cédric Le Goater 
---

 Marc,

 The Fixes tag is there because the patch in question revealed that
 something was broken in XIVE. genirq is not in cause. However, I
 don't know for PS3 and Cell. May be less critical for now. 
 
 arch/powerpc/sysdev/xive/common.c | 3 +--
 arch/powerpc/sysdev/xive/Kconfig  | 1 -
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index fed6fd16c8f4..9d0f0fe25598 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops = {
 
 static void __init xive_init_host(struct device_node *np)
 {
-   xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ,
-  _irq_domain_ops, NULL);
+   xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL);
if (WARN_ON(xive_irq_domain == NULL))
return;
irq_set_default_host(xive_irq_domain);
diff --git a/arch/powerpc/sysdev/xive/Kconfig b/arch/powerpc/sysdev/xive/Kconfig
index 97796c6b63f0..785c292d104b 100644
--- a/arch/powerpc/sysdev/xive/Kconfig
+++ b/arch/powerpc/sysdev/xive/Kconfig
@@ -3,7 +3,6 @@ config PPC_XIVE
bool
select PPC_SMP_MUXED_IPI
select HARDIRQS_SW_RESEND
-   select IRQ_DOMAIN_NOMAP
 
 config PPC_XIVE_NATIVE
bool
-- 
2.31.1



Re: [PATCH v2 2/5] preempt/dynamic: Introduce preempt mode accessors

2021-11-16 Thread Christophe Leroy




Le 10/11/2021 à 21:24, Valentin Schneider a écrit :

CONFIG_PREEMPT{_NONE, _VOLUNTARY} designate either:
o The build-time preemption model when !PREEMPT_DYNAMIC
o The default boot-time preemption model when PREEMPT_DYNAMIC

IOW, using those on PREEMPT_DYNAMIC kernels is meaningless - the actual
model could have been set to something else by the "preempt=foo" cmdline
parameter.

Introduce a set of helpers to determine the actual preemption mode used by
the live kernel.

Suggested-by: Marco Elver 
Signed-off-by: Valentin Schneider 
---
  include/linux/sched.h | 16 
  kernel/sched/core.c   | 11 +++
  2 files changed, 27 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5f8db54226af..0640d5622496 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2073,6 +2073,22 @@ static inline void cond_resched_rcu(void)
  #endif
  }
  
+#ifdef CONFIG_PREEMPT_DYNAMIC

+
+extern bool is_preempt_none(void);
+extern bool is_preempt_voluntary(void);
+extern bool is_preempt_full(void);


Those are trivial tests supposed to be used in fast pathes. They should 
be static inlines in order to minimise the overhead.



+
+#else
+
+#define is_preempt_none() IS_ENABLED(CONFIG_PREEMPT_NONE)
+#define is_preempt_voluntary() IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)
+#define is_preempt_full() IS_ENABLED(CONFIG_PREEMPT)


Would be better to use static inlines here as well instead of macros.


+
+#endif
+
+#define is_preempt_rt() IS_ENABLED(CONFIG_PREEMPT_RT)
+
  /*
   * Does a critical section need to be broken due to another
   * task waiting?: (technically does not depend on CONFIG_PREEMPTION,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 97047aa7b6c2..9db7f77e53c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6638,6 +6638,17 @@ static void __init preempt_dynamic_init(void)
}
  }
  
+#define PREEMPT_MODE_ACCESSOR(mode) \

+   bool is_preempt_##mode(void)
 \
+   {   
 \
+   WARN_ON_ONCE(preempt_dynamic_mode == 
preempt_dynamic_undefined); \


Not sure using WARN_ON is a good idea here, as it may be called very 
early, see comment on powerpc patch.



+   return preempt_dynamic_mode == preempt_dynamic_##mode;  
 \
+   }


I'm not sure that's worth a macro. You only have 3 accessors, 2 lines of 
code each. Just define all 3 in plain text.


CONFIG_PREEMPT_DYNAMIC is based on using strategies like static_calls in 
order to minimise the overhead. For those accessors you should use the 
same kind of approach and use things like jump_labels in order to not 
redo the test at each time and minimise overhead as much as possible.



+
+PREEMPT_MODE_ACCESSOR(none)
+PREEMPT_MODE_ACCESSOR(voluntary)
+PREEMPT_MODE_ACCESSOR(full)
+
  #else /* !CONFIG_PREEMPT_DYNAMIC */
  
  static inline void preempt_dynamic_init(void) { }




[PATCH 4/7] KVM: mips: Use Makefile.kvm for common files

2021-11-16 Thread David Woodhouse
From: David Woodhouse 

Signed-off-by: David Woodhouse 
---
 arch/mips/kvm/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/mips/kvm/Makefile b/arch/mips/kvm/Makefile
index d3710959da55..21ff75bcdbc4 100644
--- a/arch/mips/kvm/Makefile
+++ b/arch/mips/kvm/Makefile
@@ -2,9 +2,10 @@
 # Makefile for KVM support for MIPS
 #
 
+include $(srctree)/virt/kvm/Makefile.kvm
+
 ccflags-y += -Ivirt/kvm -Iarch/mips/kvm
 
-kvm-y := $(addprefix ../../../virt/kvm/, kvm_main.o coalesced_mmio.o eventfd.o 
binary_stats.o)
 kvm-$(CONFIG_CPU_HAS_MSA) += msa.o
 
 kvm-y +=mips.o emulate.o entry.o \
-- 
2.31.1



[PATCH 5/7] KVM: RISC-V: Use Makefile.kvm for common files

2021-11-16 Thread David Woodhouse
From: David Woodhouse 

Signed-off-by: David Woodhouse 
---
 arch/riscv/kvm/Makefile | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
index 30cdd1df0098..300590225348 100644
--- a/arch/riscv/kvm/Makefile
+++ b/arch/riscv/kvm/Makefile
@@ -5,14 +5,10 @@
 
 ccflags-y += -I $(srctree)/$(src)
 
-KVM := ../../../virt/kvm
+include $(srctree)/virt/kvm/Makefile.kvm
 
 obj-$(CONFIG_KVM) += kvm.o
 
-kvm-y += $(KVM)/kvm_main.o
-kvm-y += $(KVM)/coalesced_mmio.o
-kvm-y += $(KVM)/binary_stats.o
-kvm-y += $(KVM)/eventfd.o
 kvm-y += main.o
 kvm-y += vm.o
 kvm-y += vmid.o
-- 
2.31.1



[PATCH 3/7] KVM: s390: Use Makefile.kvm for common files

2021-11-16 Thread David Woodhouse
From: David Woodhouse 

Signed-off-by: David Woodhouse 
---
 arch/s390/kvm/Makefile | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kvm/Makefile b/arch/s390/kvm/Makefile
index b3aaadc60ead..e4f50453cf7f 100644
--- a/arch/s390/kvm/Makefile
+++ b/arch/s390/kvm/Makefile
@@ -3,13 +3,11 @@
 #
 # Copyright IBM Corp. 2008
 
-KVM := ../../../virt/kvm
-common-objs = $(KVM)/kvm_main.o $(KVM)/eventfd.o  $(KVM)/async_pf.o \
- $(KVM)/irqchip.o $(KVM)/vfio.o $(KVM)/binary_stats.o
+include $(srctree)/virt/kvm/Makefile.kvm
 
 ccflags-y := -Ivirt/kvm -Iarch/s390/kvm
 
-kvm-objs := $(common-objs) kvm-s390.o intercept.o interrupt.o priv.o sigp.o
+kvm-objs := kvm-s390.o intercept.o interrupt.o priv.o sigp.o
 kvm-objs += diag.o gaccess.o guestdbg.o vsie.o pv.o
 
 obj-$(CONFIG_KVM) += kvm.o
-- 
2.31.1



[PATCH 7/7] KVM: arm64: Use Makefile.kvm for common files

2021-11-16 Thread David Woodhouse
From: David Woodhouse 

Signed-off-by: David Woodhouse 
---
 arch/arm64/kvm/Makefile | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 989bb5dad2c8..04a53f71a6b6 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -5,14 +5,12 @@
 
 ccflags-y += -I $(srctree)/$(src)
 
-KVM=../../../virt/kvm
+include $(srctree)/virt/kvm/Makefile.kvm
 
 obj-$(CONFIG_KVM) += kvm.o
 obj-$(CONFIG_KVM) += hyp/
 
-kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
-$(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
-arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
+kvm-y += arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
 inject_fault.o va_layout.o handle_exit.o \
 guest.o debug.o reset.o sys_regs.o \
 vgic-sys-reg-v3.o fpsimd.o pmu.o \
-- 
2.31.1



[PATCH 2/7] KVM: Add Makefile.kvm for common files, use it for x86

2021-11-16 Thread David Woodhouse
From: David Woodhouse 

Splitting kvm_main.c out into smaller and better-organized files is
slightly non-trivial when it involves editing a bunch of per-arch
KVM makefiles. Provide virt/kvm/Makefile.kvm for them to include.

Signed-off-by: David Woodhouse 
---
 arch/x86/kvm/Makefile |  7 +--
 virt/kvm/Makefile.kvm | 13 +
 2 files changed, 14 insertions(+), 6 deletions(-)
 create mode 100644 virt/kvm/Makefile.kvm

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 75dfd27b6e8a..30f244b64523 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -7,12 +7,7 @@ ifeq ($(CONFIG_FRAME_POINTER),y)
 OBJECT_FILES_NON_STANDARD_vmenter.o := y
 endif
 
-KVM := ../../../virt/kvm
-
-kvm-y  += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
-   $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o 
\
-   $(KVM)/dirty_ring.o $(KVM)/binary_stats.o
-kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
+include $(srctree)/virt/kvm/Makefile.kvm
 
 kvm-y  += x86.o emulate.o i8259.o irq.o lapic.o \
   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
new file mode 100644
index ..ee9c310f3601
--- /dev/null
+++ b/virt/kvm/Makefile.kvm
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for Kernel-based Virtual Machine module
+#
+
+KVM ?= ../../../virt/kvm
+
+kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o
+kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
+kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o
+kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
+kvm-$(CONFIG_HAVE_KVM_IRQCHIP) += $(KVM)/irqchip.o
+kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
-- 
2.31.1



[PATCH 6/7] KVM: powerpc: Use Makefile.kvm for common files

2021-11-16 Thread David Woodhouse
From: David Woodhouse 

It's all fairly baroque but in the end, I don't think there's any reason
for $(KVM)/irqchip.o to have been handled differently, as they all end
up in $(kvm-y) in the end anyway, regardless of whether they get there
via $(common-objs-y) and the CPU-specific object lists.

The generic Makefile.kvm uses HAVE_KVM_IRQCHIP for irqchip.o instead of
HAVE_KVM_IRQ_ROUTING. That change is fine (and arguably correct) because
they are both set together for KVM_MPIC, or neither is set.

Signed-off-by: David Woodhouse 
---
 arch/powerpc/kvm/Makefile | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 583c14ef596e..245f59118413 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -4,11 +4,8 @@
 #
 
 ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
-KVM := ../../../virt/kvm
 
-common-objs-y = $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o
-common-objs-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
-common-objs-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o
+include $(srctree)/virt/kvm/Makefile.kvm
 
 common-objs-y += powerpc.o emulate_loadstore.o
 obj-$(CONFIG_KVM_EXIT_TIMING) += timing.o
@@ -125,7 +122,6 @@ kvm-book3s_32-objs := \
 kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
 
 kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
-kvm-objs-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
 
 kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
 
-- 
2.31.1



[PATCH 1/7] KVM: Introduce CONFIG_HAVE_KVM_DIRTY_RING

2021-11-16 Thread David Woodhouse
From: David Woodhouse 

I'd like to make the build include dirty_ring.c based on whether the
arch wants it or not. That's a whole lot simpler if there's a config
symbol instead of doing it implicitly on KVM_DIRTY_LOG_PAGE_OFFSET
being set to something non-zero.

Signed-off-by: David Woodhouse 
---
 arch/x86/kvm/Kconfig   | 1 +
 include/linux/kvm_dirty_ring.h | 8 
 virt/kvm/Kconfig   | 3 +++
 virt/kvm/kvm_main.c| 4 ++--
 4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 619186138176..d7fa0a42ac25 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,6 +27,7 @@ config KVM
select MMU_NOTIFIER
select HAVE_KVM_IRQCHIP
select HAVE_KVM_IRQFD
+   select HAVE_KVM_DIRTY_RING
select IRQ_BYPASS_MANAGER
select HAVE_KVM_IRQ_BYPASS
select HAVE_KVM_IRQ_ROUTING
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index 120e5e90fa1d..4da8d4a4140b 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -27,9 +27,9 @@ struct kvm_dirty_ring {
int index;
 };
 
-#if (KVM_DIRTY_LOG_PAGE_OFFSET == 0)
+#ifndef CONFIG_HAVE_KVM_DIRTY_RING
 /*
- * If KVM_DIRTY_LOG_PAGE_OFFSET not defined, kvm_dirty_ring.o should
+ * If CONFIG_HAVE_HVM_DIRTY_RING not defined, kvm_dirty_ring.o should
  * not be included as well, so define these nop functions for the arch.
  */
 static inline u32 kvm_dirty_ring_get_rsvd_entries(void)
@@ -74,7 +74,7 @@ static inline bool kvm_dirty_ring_soft_full(struct 
kvm_dirty_ring *ring)
return true;
 }
 
-#else /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */
+#else /* CONFIG_HAVE_KVM_DIRTY_RING */
 
 u32 kvm_dirty_ring_get_rsvd_entries(void);
 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
@@ -98,6 +98,6 @@ struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring 
*ring, u32 offset);
 void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
 bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
 
-#endif /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */
+#endif /* CONFIG_HAVE_KVM_DIRTY_RING */
 
 #endif /* KVM_DIRTY_RING_H */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 62b39149b8c8..97cf5413ac25 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -13,6 +13,9 @@ config HAVE_KVM_IRQFD
 config HAVE_KVM_IRQ_ROUTING
bool
 
+config HAVE_KVM_DIRTY_RING
+   bool
+
 config HAVE_KVM_EVENTFD
bool
select EVENTFD
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9646bb9112c1..356d636e037d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3411,7 +3411,7 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 
 static bool kvm_page_in_dirty_ring(struct kvm *kvm, unsigned long pgoff)
 {
-#if KVM_DIRTY_LOG_PAGE_OFFSET > 0
+#ifdef CONFIG_HAVE_KVM_DIRTY_RING
return (pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
(pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
 kvm->dirty_ring_size / PAGE_SIZE);
@@ -4114,7 +4114,7 @@ static long kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
case KVM_CAP_NR_MEMSLOTS:
return KVM_USER_MEM_SLOTS;
case KVM_CAP_DIRTY_LOG_RING:
-#if KVM_DIRTY_LOG_PAGE_OFFSET > 0
+#ifdef KVM_HAVE_KVM_DIRTY_RING
return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct 
kvm_dirty_gfn);
 #else
return 0;
-- 
2.31.1



[PATCH 0/7] KVM: Add Makefile.kvm for common files

2021-11-16 Thread David Woodhouse
On Mon, 2021-11-15 at 20:26 +0100, Paolo Bonzini wrote:
> > > Also, for the small requests: since you are at it, can you add the code
> > > in a new file under virt/kvm/?
> >
> > Hm... only if I can make hva_to_pfn() and probably a handful of other
> > things non-static?
> 
> Yes, I think sooner or later we also want all pfn stuff in one file 
> (together with MMU notifiers) and all hva stuff in another; so for now 
> you can create virt/kvm/hva_to_pfn.h, or virt/kvm/mm.h, or whatever 
> color of the bikeshed you prefer.


OK... let's start with this.

David Woodhouse (7):
  KVM: Introduce CONFIG_HAVE_KVM_DIRTY_RING
  KVM: Add Makefile.kvm for common files, use it for x86
  KVM: s390: Use Makefile.kvm for common files
  KVM: mips: Use Makefile.kvm for common files
  KVM: RISC-V: Use Makefile.kvm for common files
  KVM: powerpc: Use Makefile.kvm for common files
  KVM: arm64: Use Makefile.kvm for common files

 arch/arm64/kvm/Makefile|  6 ++
 arch/mips/kvm/Makefile |  3 ++-
 arch/powerpc/kvm/Makefile  |  6 +-
 arch/riscv/kvm/Makefile|  6 +-
 arch/s390/kvm/Makefile |  6 ++
 arch/x86/kvm/Kconfig   |  1 +
 arch/x86/kvm/Makefile  |  7 +--
 include/linux/kvm_dirty_ring.h |  8 
 virt/kvm/Kconfig   |  3 +++
 virt/kvm/Makefile.kvm  | 13 +
 virt/kvm/kvm_main.c|  4 ++--
 11 files changed, 32 insertions(+), 31 deletions(-)


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH 16/39] irqdomain: Make normal and nomap irqdomains exclusive

2021-11-16 Thread Cédric Le Goater

Hello Marc,


This patch is breaking the POWER9/POWER10 XIVE driver (these are not
old PPC systems :) on machines sharing the same LSI HW IRQ. For instance,
a linux KVM guest with a virtio-rng and a virtio-balloon device. In that
case, Linux creates two distinct IRQ mappings which can lead to some
unexpected behavior.


Either the irq domain translates, or it doesn't. If the driver creates
a nomap domain, and yet expects some sort of translation to happen,
then the driver is fundamentally broken. And even without that: how do
you end-up with a single HW interrupt having two mappings?


A fix to go forward would be to change the XIVE IRQ domain to use a
'Tree' domain for reverse mapping and not the 'No Map' domain mapping.
I will keep you updated for XIVE.


I bet there is a bit more to it. From what you are saying above,
something rather ungodly is happening in the XIVE code.


It's making progress.

This change in irq_find_mapping() is what 'breaks' XIVE :

  +   if (irq_domain_is_nomap(domain)) {
  +   if (hwirq < domain->revmap_size) {
  +  data = irq_domain_get_irq_data(domain, hwirq);
  +  if (data && data->hwirq == hwirq)
  +   return hwirq;
  +   }
  +
  +   return 0;


With the introduction of IRQ_DOMAIN_FLAG_NO_MAP, the revmap_tree lookup
is skipped and the previously mapped IRQ is not found. XIVE was relying
on a side effect of irq_domain_set_mapping() which is not true anymore.

I guess the easiest fix for 5.14 and 5.15 (in which was introduced MSI
domains) is to change the XIVE IRQ domain to a domain tree. Since the HW
can handle 1MB interrupts, this looks like a better choice for the driver.

Thanks,

C.


Re: Build regressions/improvements in v5.16-rc1

2021-11-16 Thread Thomas Bogendoerfer
On Mon, Nov 15, 2021 at 05:12:50PM +0100, Geert Uytterhoeven wrote:
> >   + error: modpost: "mips_cm_is64" [drivers/pci/controller/pcie-mt7621.ko] 
> > undefined!:  => N/A
> >   + error: modpost: "mips_cm_lock_other" 
> > [drivers/pci/controller/pcie-mt7621.ko] undefined!:  => N/A
> >   + error: modpost: "mips_cm_unlock_other" 
> > [drivers/pci/controller/pcie-mt7621.ko] undefined!:  => N/A
> >   + error: modpost: "mips_cpc_base" [drivers/pci/controller/pcie-mt7621.ko] 
> > undefined!:  => N/A
> >   + error: modpost: "mips_gcr_base" [drivers/pci/controller/pcie-mt7621.ko] 
> > undefined!:  => N/A
> 
> mips-allmodconfig

there is a patchset fixing this

https://lore.kernel.org/all/2025070809.15529-1-sergio.paracuel...@gmail.com/

> > 3 warning regressions:
> >   + : warning: #warning syscall futex_waitv not implemented [-Wcpp]: 
> >  => 1559:2
> 
> powerpc, m68k, mips, s390, parisc (and probably more)

I've queued a patch to fix this for mips.

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.[ RFC1925, 2.3 ]


Re: [PATCH 0/3] KEXEC_SIG with appended signature

2021-11-16 Thread Michal Suchánek
On Mon, Nov 15, 2021 at 06:53:53PM -0500, Nayna wrote:
> 
> On 11/12/21 03:30, Michal Suchánek wrote:
> > Hello,
> > 
> > On Thu, Nov 11, 2021 at 05:26:41PM -0500, Nayna wrote:
> > > On 11/8/21 07:05, Michal Suchánek wrote:
> > > > Hello,
> > > > 

> > > > The other part is that distributions apply 'lockdown' patches that 
> > > > change
> > > > the security policy depending on secure boot status which were rejected
> > > > by upstream which only hook into the _SIG options, and not into the IMA_
> > > > options. Of course, I expect this to change when the IMA options are
> > > > universally available across architectures and the support picked up by
> > > > distributions.
> > > > 
> > > > Which brings the third point: IMA features vary across architectures,
> > > > and KEXEC_SIG is more common than IMA_KEXEC.
> > > > 
> > > > config/arm64/default:CONFIG_HAVE_IMA_KEXEC=y
> > > > config/ppc64le/default:CONFIG_HAVE_IMA_KEXEC=y
> > > > 
> > > > config/arm64/default:CONFIG_KEXEC_SIG=y
> > > > config/s390x/default:CONFIG_KEXEC_SIG=y
> > > > config/x86_64/default:CONFIG_KEXEC_SIG=y
> > > > 
> > > > KEXEC_SIG makes it much easier to get uniform features across
> > > > architectures.
> > > Architectures use KEXEC_SIG vs IMA_KEXEC based on their requirement.
> > > IMA_KEXEC is for the kernel images signed using sign-file (appended
> > > signatures, not PECOFF), provides measurement along with verification, and
> > That's certainly not the case. S390 uses appended signatures with
> > KEXEC_SIG, arm64 uses PECOFF with both KEXEC_SIG and IMA_KEXEC.
> 
> Yes, S390 uses appended signature, but they also do not support
> measurements.
> 
> On the other hand for arm64/x86, PECOFF works only with KEXEC_SIG. Look at
> the KEXEC_IMAGE_VERIFY_SIG config dependencies in arch/arm64/Kconfig and
> KEXEC_BZIMAGE_VERIFY_SIG config dependencies in arch/x86/Kconfig. Now, if
> KEXEC_SIG is not enabled, then IMA appraisal policies are enforced if secure
> boot is enabled, refer to security/integrity/ima_efi.c . IMA would fail
> verification if kernel is not signed with module sig appended signatures or
> signature verification fails.
> 
> In short, IMA is used to enforce the existence of a policy if secure boot is
> enabled. If they don't support module sig appended signatures, by definition
> it fails. Thus PECOFF doesn't work with both KEXEC_SIG and IMA_KEXEC, but
> only with KEXEC_SIG.

Then IMA_KEXEC is a no-go. It is not supported on all architectures and
it principially cannot be supported because it does not support PECOFF
which is needed to boot the kernel on EFI platforms. To get feature
parity across architectures KEXEC_SIG is required.

> > 
> > > is tied to secureboot state of the system at boot time.
> > In distrubutions it's also the case with KEXEC_SIG, it's only upstream
> > where this is different. I don't know why Linux upstream has rejected
> > this support for KEXEC_SIG.
> > 
> > Anyway, sounds like the difference is that IMA provides measurement but
> > if you don't use it it does not makes any difference except more comlex
> > code.
> I am unsure what do you mean by "complex code" here. Can you please
> elaborate ? IMA policies support for secureboot already exists and can be
> used as it is without adding any extra work as in
> arch/powerpc/kernel/ima_arch.c.

The code exists but using it to replace KEXEC_SIG also requires
understanding the code and the implications of using it. At a glance the
IMA codebase is much bigger and more convoluted compared to KEXEC_SIG
and MODULE_SIG.

Thanks

Michal


Re: [PATCH v2 3/3] soc: fsl: Replace kernel.h with the necessary inclusions

2021-11-16 Thread Andy Shevchenko
On Mon, Nov 15, 2021 at 10:24:36PM +, Leo Li wrote:
> > From: Andy Shevchenko 
> > Sent: Monday, November 15, 2021 5:30 AM
> > On Wed, Nov 10, 2021 at 12:59:52PM +0200, Andy Shevchenko wrote:

...

> > > v2: updated Cc list based on previous changes to MAINTAINERS
> > 
> > Any comments on this, please?
> > 
> > I really want to decrease amount of kernel.h usage in the common headers.
> > So others won't copy'n'paste bad example.
> 
> There seems to be no problem with the patch although I didn't get time to 
> really compile with it applied.
> 
> Will pick them up later after build test.

Thank you!

Note, it has two fixes against MAINTAINERS which may be sent, I believe,
sooner than later to Linus.

-- 
With Best Regards,
Andy Shevchenko




Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.

2021-11-16 Thread David Hildenbrand
On 15.11.21 20:37, Zi Yan wrote:
> From: Zi Yan 
> 
> Hi David,

Hi,

thanks for looking into this.

> 
> You suggested to make alloc_contig_range() deal with pageblock_order instead 
> of
> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This
> patchset is my attempt to achieve that. Please take a look and let me know if
> I am doing it correctly or not.
> 
> From what my understanding, cma required alignment of
> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was introduced,
> __free_one_page() does not prevent merging two different pageblocks, when
> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() implementation
> does prevent that. It should be OK to just align cma to pageblock_order.
> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use
> pageblock_order as alignment too.

I wonder if that's sufficient. Especially the outer_start logic in
alloc_contig_range() might be problematic. There are some ugly corner
cases with free pages/allocations spanning multiple pageblocks and we
only isolated a single pageblock.


Regarding CMA, we have to keep the following cases working:

a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER
   - 1 page:
   [   MAX_ ORDER - 1 ]
   [ pageblock 0 | pageblock 1]

Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA,
but not both. We have to make sure alloc_contig_range() keeps working
correctly. This should be the case even with your change, as we won't
merging pages accross differing migratetypes.

b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated:
   [   MAX_ ORDER - 1 ]
   [ pageblock 0 | pageblock 1]

Assume both are MIGRATE_CMA. Assume we want to either allocate from
pageblock 0 or pageblock 1. Especially, assume we want to allocate from
pageblock 1. While we would isolate pageblock 1, we wouldn't isolate
pageblock 0.

What happens if we either have a free page spanning the MAX_ORDER - 1
range already OR if we have to migrate a MAX_ORDER - 1 page, resulting
in a free MAX_ORDER - 1 page of which only the second pageblock is
isolated? We would end up essentially freeing a page that has mixed
pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I
might be wrong but I have the feeling that this would be problematic.

c) Concurrent allocations:
[   MAX_ ORDER - 1 ]
[ pageblock 0 | pageblock 1]

Assume b) but we have two concurrent CMA allocations to pageblock 0 and
pageblock 1, which would now be possible as start_isolate_page_range()
isolate would succeed on both.


Regarding virtio-mem, we care about the following cases:

a) Allocating parts from completely movable MAX_ ORDER - 1 page:
   [   MAX_ ORDER - 1 ]
   [ pageblock 0 | pageblock 1]

Assume pageblock 0 and pageblock 1 are either free or contain only
movable pages. Assume we allocated pageblock 0. We have to make sure we
can allocate pageblock 1. The other way around, assume we allocated
pageblock 1, we have to make sure we can allocate pageblock 0.

Free pages spanning both pageblocks might be problematic.

b) Allocate parts of partially movable MAX_ ORDER - 1 page:
   [   MAX_ ORDER - 1 ]
   [ pageblock 0 | pageblock 1]

Assume pageblock 0 contains unmovable data but pageblock 1 not: we have
to make sure we can allocate pageblock 1. Similarly, assume pageblock 1
contains unmovable data but pageblock 0 no: we have to make sure we can
allocate pageblock 1. has_unmovable_pages() might allow for that.

But, we want to fail early in case we want to allocate a single
pageblock but it contains unmovable data. This could be either directly
or indirectly.

If we have an unmovable (compound) MAX_ ORDER - 1 and we'd try isolating
pageblock 1, has_unmovable_pages() would always return "false" because
we'd simply be skiping over any tail pages, and not detect the
un-movability.

c) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated:

Same concern as for CMA b)


So the biggest concern I have is dealing with migrating/freeing >
pageblock_order pages while only having isolated a single pageblock.

> 
> In terms of virtio_mem, if I understand correctly, it relies on
> alloc_contig_range() to obtain contiguous free pages and offlines them to 
> reduce
> guest memory size. As the result of alloc_contig_range() alignment change,
> virtio_mem should be able to just align PFNs to pageblock_order.

For virtio-mem it will most probably be desirable to first try
allocating the MAX_ORDER -1 range if possible and then fallback to
pageblock_order. But that's an additional change on top in virtio-mem code.



My take to teach alloc_contig_range() to properly handle would be the
following:

a) Convert MIGRATE_ISOLATE into a separate pageblock flag

We would want to convert MIGRATE_ISOLATE into a separate pageblock
flags, such that when we isolate a page block we preserve the original
migratetype. While start_isolate_page_range() 

Re: [PATCH 16/39] irqdomain: Make normal and nomap irqdomains exclusive

2021-11-16 Thread Marc Zyngier
On Mon, 15 Nov 2021 19:05:17 +,
Cédric Le Goater  wrote:
> 
> Hello Mark,

s/k/c/

> 
> On 5/20/21 18:37, Marc Zyngier wrote:
> > Direct mappings are completely exclusive of normal mappings, meaning
> > that we can refactor the code slightly so that we can get rid of
> > the revmap_direct_max_irq field and use the revmap_size field
> > instead, reducing the size of the irqdomain structure.
> > 
> > Signed-off-by: Marc Zyngier 
> 
> 
> This patch is breaking the POWER9/POWER10 XIVE driver (these are not
> old PPC systems :) on machines sharing the same LSI HW IRQ. For instance,
> a linux KVM guest with a virtio-rng and a virtio-balloon device. In that
> case, Linux creates two distinct IRQ mappings which can lead to some
> unexpected behavior.

Either the irq domain translates, or it doesn't. If the driver creates
a nomap domain, and yet expects some sort of translation to happen,
then the driver is fundamentally broken. And even without that: how do
you end-up with a single HW interrupt having two mappings?

> A fix to go forward would be to change the XIVE IRQ domain to use a
> 'Tree' domain for reverse mapping and not the 'No Map' domain mapping.
> I will keep you updated for XIVE.

I bet there is a bit more to it. From what you are saying above,
something rather ungodly is happening in the XIVE code.

M.

-- 
Without deviation from the norm, progress is not possible.