Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-12-19 Thread Michael Ellerman
"Jason A. Donenfeld"  writes:
> Hi folks,
>
> I'm actually still experiencing this sporadically in the WireGuard test 
> suite, which you can see being run on https://build.wireguard.com/ . 

Fancy dashboard you got there :)

> About 50% of the time the powerpc64 build will fail at a place like this:
>
> [   65.147823] Oops: Exception in kernel mode, sig: 4 [#1]
> [   65.149198] LE PAGE_SIZE=4K MMU=Hash PREEMPT SMP NR_CPUS=4 NUMA pSeries
> [   65.149595] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc1+ #1
> [   65.149745] NIP:  c000 LR: c007eda0 CTR: 
> c007ed80
> [   65.149934] REGS: c0a47970 TRAP: 0700   Not tainted  (5.5.0-rc1+)
> [   65.150032] MSR:  8288b033  > CR: 
> 48008288  XER: 
> [   65.150352] CFAR: c00332bc IRQMASK: 1
> [   65.150352] GPR00: c0036508 c0a47c00 c0a4c100 
> 0001
> [   65.150352] GPR04: c0a50998  c0a50908 
> 0f509000
> [   65.150352] GPR08: 2800   
> cff24f00
> [   65.150352] GPR12: c007ed80 c0ad9000  
> 
> [   65.150352] GPR16: 008c9190 008c94a8 008c92f8 
> 008c98b0
> [   65.150352] GPR20: 008f2f88 fffd 0014 
> 00e6c100
> [   65.150352] GPR24: 00e6c100 0001  
> c0a50998
> [   65.150352] GPR28: c0a9e280 c0a50aa4 0002 
> 
> [   65.151591] NIP [c000] doorbell_try_core_ipi+0xd0/0xf0
> [   65.151750] LR [c007eda0] smp_pseries_cause_ipi+0x20/0x70
> [   65.151913] Call Trace:
> [   65.152109] [c0a47c00] [c00c7c9c] 
> _nohz_idle_balance+0xbc/0x300 (unreliable)
> [   65.152370] [c0a47c30] [c0036508] 
> smp_send_reschedule+0x98/0xb0
> [   65.152711] [c0a47c50] [c00c1634] kick_ilb+0x114/0x140
> [   65.152962] [c0a47ca0] [c00c86d8] 
> newidle_balance+0x4e8/0x500
> [   65.153213] [c0a47d20] [c00c8788] 
> pick_next_task_fair+0x48/0x3a0
> [   65.153424] [c0a47d80] [c0466620] __schedule+0xf0/0x430
> [   65.153612] [c0a47de0] [c0466b04] schedule_idle+0x34/0x70
> [   65.153786] [c0a47e10] [c00c0bc8] do_idle+0x1a8/0x220
> [   65.154121] [c0a47e70] [c00c0e94] 
> cpu_startup_entry+0x34/0x40
> [   65.154313] [c0a47ea0] [c000ef1c] rest_init+0x10c/0x124
> [   65.154414] [c0a47ee0] [c054] start_kernel+0x568/0x594
> [   65.154585] [c0a47f90] [c000a7cc] 
> start_here_common+0x1c/0x330
> [   65.154854] Instruction dump:
> [   65.155191] 38210030 e8010010 7c0803a6 4e800020 3d220004 39295228 8129 
> 3929
> [   65.155498] 7d284038 7c0004ac 5508017e 65082800 <7c00411c> e94d0178 
> 812a 3929
  ^
Again the faulting instruction there is "msgsndp r8"

> [   65.156155] ---[ end trace 6180d12e268ffdaf ]---
> [   65.185452]
> [   66.187490] Kernel panic - not syncing: Fatal exception
>
> This is with "qemu-system-ppc64 -smp 4 -machine pseries" on QEMU 4.0.0.
>
> I'm not totally sure what's going on here. I'm emulating a pseries, and 
> using that with qemu's pseries model, and I see that selecting the 
> pseries forces the selection of 'config PPC_DOORBELL' (twice in the same 
> section, actually).

Noted.

> Then inside the kernel there appears to be some runtime CPU check for
> doorbell support.

Not really. The kernel looks at the CPU revision (PVR) and decides that
it has doorbell support.

> Is this a case in which QEMU is advertising doorbell support that TCG
> doesn't have? Or is something else happening here?

It's a gap in the emulation I guess. qemu doesn't emulate msgsndp, but
it really should because that's a supported instruction since Power8.

I suspect msgsndp wasn't implemented for TCG because TCG doesn't support
more than one thread per core, and you can only send doorbells to other
threads in the same core, and therefore there is no reason to ever use
msgsndp.

That's the message Suraj mentioned up thread, eg:

  $ qemu-system-ppc64 -nographic -vga none -M pseries -smp 2,threads=2 -cpu 
POWER8 -kernel build~/vmlinux
  qemu-system-ppc64: TCG cannot support more than 1 thread/core on a pseries 
machine


But I guess we've hit another case of a CPU sending itself an IPI, and
the way the sibling masks are done, CPUs are siblings of themselves, so
the sibling test passes, eg:

int doorbell_try_core_ipi(int cpu)
{
int this_cpu = get_cpu();
int ret = 0;

if (cpumask_test_cpu(cpu, cpu_sibling_mask(this_cpu))) {
doorbell_core_ipi(cpu);



In which case this patch should fix it.

diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c
index f17ff1200eaa..e45cb9bba193 100644
--- 

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-19 Thread Leon Romanovsky
On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> Hi,
>
> This implements an API naming change (put_user_page*() -->
> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> extends that tracking to a few select subsystems. More subsystems will
> be added in follow up work.

Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
 ib_umem_geti() ->
  pin_user_pages_fast(FOLL_WRITE) ->
   internal_get_user_pages_fast(FOLL_WRITE) ->
gup_pgd_range() ->
 gup_huge_pd() ->
  gup_hugepte() ->
   try_grab_compound_head() ->

 108 static __maybe_unused struct page *try_grab_compound_head(struct page 
*page,
 109   int refs,
 110   unsigned int 
flags)
 111 {
 112 if (flags & FOLL_GET)
 113 return try_get_compound_head(page, refs);
 114 else if (flags & FOLL_PIN)
 115 return try_pin_compound_head(page, refs);
 116
 117 WARN_ON_ONCE(1);
 118 return NULL;
 119 }

# (master) $ dmesg
[10924.70] mlx5_core :00:08.0 eth2: Link up
[10924.725383] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[10960.902254] [ cut here ]
[10960.905614] WARNING: CPU: 3 PID: 8838 at mm/gup.c:61 
try_grab_compound_head+0x92/0xd0
[10960.907313] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss 
nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt 
target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm 
ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc 
virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 
ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core 
sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last 
unloaded: mlxkvl]
[10960.917806] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G   OE 
5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1
[10960.920530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[10960.923024] RIP: 0010:try_grab_compound_head+0x92/0xd0
[10960.924329] Code: e4 8d 14 06 48 8d 4f 34 f0 0f b1 57 34 0f 94 c2 84 d2 75 
cb 85 c0 74 cd 8d 14 06 f0 0f b1 11 0f 94 c2 84 d2 75 b9 66 90 eb ea <0f> 0b 31 
ff eb b7 85 c0 66 0f 1f 44 00 00 74 ab 8d 14 06 f0 0f b1
[10960.928512] RSP: 0018:c9000129f880 EFLAGS: 00010082
[10960.929831] RAX: 8001 RBX: 7f6397446000 RCX: 000fffe0
[10960.931422] RDX: 0004 RSI: 00011800 RDI: ea000f5d8000
[10960.933005] RBP: c9000129f93c R08: c9000129f93c R09: 0020
[10960.934584] R10: 88840774b200 R11: 88800230 R12: 7f6397446000
[10960.936212] R13: 0046 R14: 8003d76000e7 R15: 0080
[10960.937793] FS:  7f63a0590740() GS:88842f98() 
knlGS:
[10960.939962] CS:  0010 DS:  ES:  CR0: 80050033
[10960.941367] CR2: 023e9008 CR3: 000406d0a002 CR4: 007606e0
[10960.942975] DR0:  DR1:  DR2: 
[10960.944654] DR3:  DR6: fffe0ff0 DR7: 0400
[10960.946394] PKRU: 5554
[10960.947310] Call Trace:
[10960.948193]  gup_pgd_range+0x61e/0x950
[10960.949585]  internal_get_user_pages_fast+0x98/0x1c0
[10960.951313]  ib_umem_get+0x2b3/0x5a0 [ib_uverbs]
[10960.952929]  mr_umem_get+0xd8/0x280 [mlx5_ib]
[10960.954150]  ? xas_store+0x49/0x550
[10960.955187]  mlx5_ib_reg_user_mr+0x149/0x7a0 [mlx5_ib]
[10960.956478]  ? xas_load+0x9/0x80
[10960.957474]  ? xa_load+0x54/0x90
[10960.958465]  ? lookup_get_idr_uobject.part.10+0x12/0x80 [ib_uverbs]
[10960.959926]  ib_uverbs_reg_mr+0x138/0x2a0 [ib_uverbs]
[10960.961192]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 
[ib_uverbs]
[10960.963208]  ib_uverbs_cmd_verbs.isra.8+0x997/0xb30 [ib_uverbs]
[10960.964603]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[10960.965949]  ? mem_cgroup_commit_charge+0x6a/0x140
[10960.967177]  ? page_add_new_anon_rmap+0x58/0xc0
[10960.968360]  ib_uverbs_ioctl+0xbc/0x130 [ib_uverbs]
[10960.969595]  do_vfs_ioctl+0xa6/0x640
[10960.970631]  ? syscall_trace_enter+0x1f8/0x2e0
[10960.971829]  ksys_ioctl+0x60/0x90
[10960.972825]  __x64_sys_ioctl+0x16/0x20
[10960.973888]  do_syscall_64+0x48/0x130
[10960.974949]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10960.976219] RIP: 0033:0x7f639fe9b267
[10960.977260] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 
c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
[10960.981413] RSP: 002b:7fff5335ca08 EFLAGS: 0246 ORIG_RAX: 
0010
[10960.983472] RAX: 

Re: use generic DMA mapping code in powerpc V4

2019-12-19 Thread Christian Zigotzky

Hi All,

We still have some issues with PCI cards in our FSL P5020 and P5040 
systems since the DMA mapping updates. [1, 2]


We have to limit the RAM to 3500MB for some problematic PCI cards. 
(kernel boot argument 'mem=3500M')


The problematic DMA mapping code was added with the PowerPC updates 
4.21-1 to the official kernel source code last year. [3]


We have created a bug report. [4]

The old 4.x kernels aren't affected because they use the old DMA code.

Please check the new DMA code again.

Thanks,
Christian

[1] 
http://forum.hyperion-entertainment.com/viewtopic.php?f=58=49486#p49486
[2] 
http://forum.hyperion-entertainment.com/viewtopic.php?f=58=4349=50#p49099
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d6973327ee84c2f40dd9efd8928d4a1186c96e2

[4] https://bugzilla.kernel.org/show_bug.cgi?id=205201


On 28 November 2018 at 4:55 pm, Christian Zigotzky wrote:

On 28 November 2018 at 12:05PM, Michael Ellerman wrote:

Nothing specific yet.

I'm a bit worried it might break one of the many old obscure platforms
we have that aren't well tested.

Please don't apply the new DMA mapping code if you don't be sure if it 
works on all supported PowerPC machines. Is the new DMA mapping code 
really necessary? It's not really nice, to rewrote code if the old 
code works perfect. We must not forget, that we work for the end 
users. Does the end user have advantages with this new code? Is it 
faster? The old code works without any problems. I am also worried 
about this code. How can I test this new DMA mapping code?


Thanks






[PATCH] powerpc/pseries: Remove redundant select of PPC_DOORBELL

2019-12-19 Thread Michael Ellerman
Commit d4e58e5928f8 ("powerpc/powernv: Enable POWER8 doorbell IPIs")
added a select of PPC_DOORBELL to PPC_PSERIES, but it already had a
select of PPC_DOORBELL. One is enough.

Reported-by: Jason A. Donenfeld 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/platforms/pseries/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 595e9f8a6539..24c18362e5ea 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,7 +21,6 @@ config PPC_PSERIES
select PPC_DOORBELL
select HOTPLUG_CPU
select ARCH_RANDOM
-   select PPC_DOORBELL
select FORCE_SMP
select SWIOTLB
default y
-- 
2.21.0



[PATCH] powerpc/85xx: Get twr_p102x to compile again

2019-12-19 Thread Sebastian Andrzej Siewior
With CONFIG_QUICC_ENGINE enabled and CONFIG_UCC_GETH + CONFIG_SERIAL_QE
disabled we have an unused variable (np). The code won't compile with
-Werror.

Move the np variable to the block where it is actually used.

Signed-off-by: Sebastian Andrzej Siewior 
---
 arch/powerpc/platforms/85xx/twr_p102x.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/85xx/twr_p102x.c 
b/arch/powerpc/platforms/85xx/twr_p102x.c
index 6c3c0cdaee9ad..b301ef9d6ce75 100644
--- a/arch/powerpc/platforms/85xx/twr_p102x.c
+++ b/arch/powerpc/platforms/85xx/twr_p102x.c
@@ -60,10 +60,6 @@ static void __init twr_p1025_pic_init(void)
  */
 static void __init twr_p1025_setup_arch(void)
 {
-#ifdef CONFIG_QUICC_ENGINE
-   struct device_node *np;
-#endif
-
if (ppc_md.progress)
ppc_md.progress("twr_p1025_setup_arch()", 0);
 
@@ -77,6 +73,7 @@ static void __init twr_p1025_setup_arch(void)
 #if IS_ENABLED(CONFIG_UCC_GETH) || IS_ENABLED(CONFIG_SERIAL_QE)
if (machine_is(twr_p1025)) {
struct ccsr_guts __iomem *guts;
+   struct device_node *np;
 
np = of_find_compatible_node(NULL, NULL, "fsl,p1021-guts");
if (np) {
-- 
2.24.0



Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-12-19 Thread Cédric Le Goater
On 19/12/2019 13:45, Michael Ellerman wrote:
> "Jason A. Donenfeld"  writes:
>> Hi folks,
>>
>> I'm actually still experiencing this sporadically in the WireGuard test 
>> suite, which you can see being run on https://build.wireguard.com/ . 
> 
> Fancy dashboard you got there :)
> 
>> About 50% of the time the powerpc64 build will fail at a place like this:
>>
>> [   65.147823] Oops: Exception in kernel mode, sig: 4 [#1]
>> [   65.149198] LE PAGE_SIZE=4K MMU=Hash PREEMPT SMP NR_CPUS=4 NUMA pSeries
>> [   65.149595] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc1+ #1
>> [   65.149745] NIP:  c000 LR: c007eda0 CTR: 
>> c007ed80
>> [   65.149934] REGS: c0a47970 TRAP: 0700   Not tainted  (5.5.0-rc1+)
>> [   65.150032] MSR:  8288b033  > 
>> CR: 48008288  XER: 
>> [   65.150352] CFAR: c00332bc IRQMASK: 1
>> [   65.150352] GPR00: c0036508 c0a47c00 c0a4c100 
>> 0001
>> [   65.150352] GPR04: c0a50998  c0a50908 
>> 0f509000
>> [   65.150352] GPR08: 2800   
>> cff24f00
>> [   65.150352] GPR12: c007ed80 c0ad9000  
>> 
>> [   65.150352] GPR16: 008c9190 008c94a8 008c92f8 
>> 008c98b0
>> [   65.150352] GPR20: 008f2f88 fffd 0014 
>> 00e6c100
>> [   65.150352] GPR24: 00e6c100 0001  
>> c0a50998
>> [   65.150352] GPR28: c0a9e280 c0a50aa4 0002 
>> 
>> [   65.151591] NIP [c000] doorbell_try_core_ipi+0xd0/0xf0
>> [   65.151750] LR [c007eda0] smp_pseries_cause_ipi+0x20/0x70
>> [   65.151913] Call Trace:
>> [   65.152109] [c0a47c00] [c00c7c9c] 
>> _nohz_idle_balance+0xbc/0x300 (unreliable)
>> [   65.152370] [c0a47c30] [c0036508] 
>> smp_send_reschedule+0x98/0xb0
>> [   65.152711] [c0a47c50] [c00c1634] kick_ilb+0x114/0x140
>> [   65.152962] [c0a47ca0] [c00c86d8] 
>> newidle_balance+0x4e8/0x500
>> [   65.153213] [c0a47d20] [c00c8788] 
>> pick_next_task_fair+0x48/0x3a0
>> [   65.153424] [c0a47d80] [c0466620] __schedule+0xf0/0x430
>> [   65.153612] [c0a47de0] [c0466b04] schedule_idle+0x34/0x70
>> [   65.153786] [c0a47e10] [c00c0bc8] do_idle+0x1a8/0x220
>> [   65.154121] [c0a47e70] [c00c0e94] 
>> cpu_startup_entry+0x34/0x40
>> [   65.154313] [c0a47ea0] [c000ef1c] rest_init+0x10c/0x124
>> [   65.154414] [c0a47ee0] [c054] start_kernel+0x568/0x594
>> [   65.154585] [c0a47f90] [c000a7cc] 
>> start_here_common+0x1c/0x330
>> [   65.154854] Instruction dump:
>> [   65.155191] 38210030 e8010010 7c0803a6 4e800020 3d220004 39295228 
>> 8129 3929
>> [   65.155498] 7d284038 7c0004ac 5508017e 65082800 <7c00411c> e94d0178 
>> 812a 3929
>   ^
> Again the faulting instruction there is "msgsndp r8"
> 
>> [   65.156155] ---[ end trace 6180d12e268ffdaf ]---
>> [   65.185452]
>> [   66.187490] Kernel panic - not syncing: Fatal exception
>>
>> This is with "qemu-system-ppc64 -smp 4 -machine pseries" on QEMU 4.0.0.
>>
>> I'm not totally sure what's going on here. I'm emulating a pseries, and 
>> using that with qemu's pseries model, and I see that selecting the 
>> pseries forces the selection of 'config PPC_DOORBELL' (twice in the same 
>> section, actually).
> 
> Noted.
> 
>> Then inside the kernel there appears to be some runtime CPU check for
>> doorbell support.
> 
> Not really. The kernel looks at the CPU revision (PVR) and decides that
> it has doorbell support.
> 
>> Is this a case in which QEMU is advertising doorbell support that TCG
>> doesn't have? Or is something else happening here?
> 
> It's a gap in the emulation I guess. qemu doesn't emulate msgsndp, but
> it really should because that's a supported instruction since Power8.

There is a patch for msgsndp in my tree you could try : 

  https://github.com/legoater/qemu/tree/powernv-5.0

Currently being reviewed. I have to address some remarks from David before
it can be merged.

> I suspect msgsndp wasn't implemented for TCG because TCG doesn't support
> more than one thread per core, and you can only send doorbells to other
> threads in the same core, and therefore there is no reason to ever use
> msgsndp.

There is a need now with KVM emulation under TCG, but, yes, QEMU still lacks
SMT support.

> That's the message Suraj mentioned up thread, eg:
> 
>   $ qemu-system-ppc64 -nographic -vga none -M pseries -smp 2,threads=2 -cpu 
> POWER8 -kernel build~/vmlinux
>   qemu-system-ppc64: TCG cannot support more than 1 thread/core on a pseries 
> machine
> 
> 
> But I guess we've hit another case of a CPU sending itself an 

Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-12-19 Thread David? Gibson
On Thu, Dec 19, 2019 at 02:08:29PM +0100, Cédric Le Goater wrote:
> On 19/12/2019 13:45, Michael Ellerman wrote:
> > "Jason A. Donenfeld"  writes:
> >> Hi folks,
> >>
> >> I'm actually still experiencing this sporadically in the WireGuard test 
> >> suite, which you can see being run on https://build.wireguard.com/ . 
> > 
> > Fancy dashboard you got there :)
> > 
> >> About 50% of the time the powerpc64 build will fail at a place like this:
> >>
> >> [   65.147823] Oops: Exception in kernel mode, sig: 4 [#1]
> >> [   65.149198] LE PAGE_SIZE=4K MMU=Hash PREEMPT SMP NR_CPUS=4 NUMA pSeries
> >> [   65.149595] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc1+ #1
> >> [   65.149745] NIP:  c000 LR: c007eda0 CTR: 
> >> c007ed80
> >> [   65.149934] REGS: c0a47970 TRAP: 0700   Not tainted  
> >> (5.5.0-rc1+)
> >> [   65.150032] MSR:  8288b033  > 
> >> CR: 48008288  XER: 
> >> [   65.150352] CFAR: c00332bc IRQMASK: 1
> >> [   65.150352] GPR00: c0036508 c0a47c00 c0a4c100 
> >> 0001
> >> [   65.150352] GPR04: c0a50998  c0a50908 
> >> 0f509000
> >> [   65.150352] GPR08: 2800   
> >> cff24f00
> >> [   65.150352] GPR12: c007ed80 c0ad9000  
> >> 
> >> [   65.150352] GPR16: 008c9190 008c94a8 008c92f8 
> >> 008c98b0
> >> [   65.150352] GPR20: 008f2f88 fffd 0014 
> >> 00e6c100
> >> [   65.150352] GPR24: 00e6c100 0001  
> >> c0a50998
> >> [   65.150352] GPR28: c0a9e280 c0a50aa4 0002 
> >> 
> >> [   65.151591] NIP [c000] doorbell_try_core_ipi+0xd0/0xf0
> >> [   65.151750] LR [c007eda0] smp_pseries_cause_ipi+0x20/0x70
> >> [   65.151913] Call Trace:
> >> [   65.152109] [c0a47c00] [c00c7c9c] 
> >> _nohz_idle_balance+0xbc/0x300 (unreliable)
> >> [   65.152370] [c0a47c30] [c0036508] 
> >> smp_send_reschedule+0x98/0xb0
> >> [   65.152711] [c0a47c50] [c00c1634] kick_ilb+0x114/0x140
> >> [   65.152962] [c0a47ca0] [c00c86d8] 
> >> newidle_balance+0x4e8/0x500
> >> [   65.153213] [c0a47d20] [c00c8788] 
> >> pick_next_task_fair+0x48/0x3a0
> >> [   65.153424] [c0a47d80] [c0466620] __schedule+0xf0/0x430
> >> [   65.153612] [c0a47de0] [c0466b04] 
> >> schedule_idle+0x34/0x70
> >> [   65.153786] [c0a47e10] [c00c0bc8] do_idle+0x1a8/0x220
> >> [   65.154121] [c0a47e70] [c00c0e94] 
> >> cpu_startup_entry+0x34/0x40
> >> [   65.154313] [c0a47ea0] [c000ef1c] rest_init+0x10c/0x124
> >> [   65.154414] [c0a47ee0] [c054] 
> >> start_kernel+0x568/0x594
> >> [   65.154585] [c0a47f90] [c000a7cc] 
> >> start_here_common+0x1c/0x330
> >> [   65.154854] Instruction dump:
> >> [   65.155191] 38210030 e8010010 7c0803a6 4e800020 3d220004 39295228 
> >> 8129 3929
> >> [   65.155498] 7d284038 7c0004ac 5508017e 65082800 <7c00411c> e94d0178 
> >> 812a 3929
> >   ^
> > Again the faulting instruction there is "msgsndp r8"
> > 
> >> [   65.156155] ---[ end trace 6180d12e268ffdaf ]---
> >> [   65.185452]
> >> [   66.187490] Kernel panic - not syncing: Fatal exception
> >>
> >> This is with "qemu-system-ppc64 -smp 4 -machine pseries" on QEMU 4.0.0.
> >>
> >> I'm not totally sure what's going on here. I'm emulating a pseries, and 
> >> using that with qemu's pseries model, and I see that selecting the 
> >> pseries forces the selection of 'config PPC_DOORBELL' (twice in the same 
> >> section, actually).
> > 
> > Noted.
> > 
> >> Then inside the kernel there appears to be some runtime CPU check for
> >> doorbell support.
> > 
> > Not really. The kernel looks at the CPU revision (PVR) and decides that
> > it has doorbell support.
> > 
> >> Is this a case in which QEMU is advertising doorbell support that TCG
> >> doesn't have? Or is something else happening here?
> > 
> > It's a gap in the emulation I guess. qemu doesn't emulate msgsndp, but
> > it really should because that's a supported instruction since Power8.
> 
> There is a patch for msgsndp in my tree you could try : 
> 
>   https://github.com/legoater/qemu/tree/powernv-5.0
> 
> Currently being reviewed. I have to address some remarks from David before
> it can be merged.

Right.  It needs some polish, but I expect we'll have this merged in
the not too distant future.

> > I suspect msgsndp wasn't implemented for TCG because TCG doesn't support
> > more than one thread per core, and you can only send doorbells to other
> > threads in the same core, and therefore there is no reason to ever use
> > msgsndp.
> 
> There is a need now with KVM 

Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-12-19 Thread Jason A. Donenfeld
On Thu, Dec 19, 2019 at 1:52 PM Michael Ellerman  wrote:
> diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c
> index f17ff1200eaa..e45cb9bba193 100644
> --- a/arch/powerpc/kernel/dbell.c
> +++ b/arch/powerpc/kernel/dbell.c
> @@ -63,7 +63,7 @@ int doorbell_try_core_ipi(int cpu)
> int this_cpu = get_cpu();
> int ret = 0;
>
> -   if (cpumask_test_cpu(cpu, cpu_sibling_mask(this_cpu))) {
> +   if (cpu != this_cpu && cpumask_test_cpu(cpu, 
> cpu_sibling_mask(this_cpu))) {
> doorbell_core_ipi(cpu);
> ret = 1;
> }

I realize the best solution is that nice powernv branch that will
eventually be merged for qemu/tcg. But, maybe the above would be a
decent idea to upstream? It seems like that's a case of a superfluous
doorbell?

Jason


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-19 Thread John Hubbard

On 12/19/19 1:07 PM, Jason Gunthorpe wrote:

On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:

On 12/19/19 5:26 AM, Leon Romanovsky wrote:

On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:

Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.


Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
   ib_umem_geti() ->
pin_user_pages_fast(FOLL_WRITE) ->
 internal_get_user_pages_fast(FOLL_WRITE) ->
  gup_pgd_range() ->
   gup_huge_pd() ->
gup_hugepte() ->
 try_grab_compound_head() ->


Hi Leon,

Thanks very much for the detailed report! So we're overflowing...

At first look, this seems likely to be hitting a weak point in the
GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
(there's a writeup in Documentation/core-api/pin_user_page.rst, lines
99-121). Basically it's pretty easy to overflow the page->_refcount
with huge pages if the pages have a *lot* of subpages.

We can only do about 7 pins on 1GB huge pages that use 4KB subpages.


Considering that establishing these pins is entirely under user
control, we can't have a limit here.


There's already a limit, it's just a much larger one. :) What does "no limit"
really mean, numerically, to you in this case?



If the number of allowed pins are exhausted then the
pin_user_pages_fast() must fail back to the user.



I'll poke around the IB call stack and see how much of that return path
is in place, if any. Because it's the same situation for get_user_pages_fast().
This code just added a warning on overflow so we could spot it early.




3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
test setup, but I have done only the tiniest bit of user space IB coding, so
if you have any test programs that aren't too hard to deal with that could
possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
that I'm not an advanced IB programmer. At all. :)


Clone this:

https://github.com/linux-rdma/rdma-core.git

Install all the required deps to build it (notably cython), see the README.md

$ ./build.sh
$ build/bin/run_tests.py

If you get things that far I think Leon can get a reproduction for you



OK, here goes.

thanks,
--
John Hubbard
NVIDIA


Re: [PATCH V3 2/2] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-12-19 Thread Sukadev Bhattiprolu
Bharata B Rao [bhar...@linux.ibm.com] wrote:
> On Sat, Dec 14, 2019 at 06:12:08PM -0800, Sukadev Bhattiprolu wrote:
> > +unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
> > +{
> > +   int i;
> > +
> > +   if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
> > +   return H_UNSUPPORTED;
> > +
> > +   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > +   struct kvm_memory_slot *memslot;
> > +   struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> > +
> > +   if (!slots)
> > +   continue;
> > +
> > +   kvm_for_each_memslot(memslot, slots)
> > +   kvmppc_uvmem_drop_pages(memslot, kvm, false);
> > +   }
> 
> You need to hold srcu_read_lock(>srcu) here.

Yes, thanks! Fixed in the next version.

Sukadev



Re: [PATCH 04/14] powerpc/vas: Setup IRQ mapping and register port for each window

2019-12-19 Thread Haren Myneni
On Wed, 2019-12-18 at 15:13 -0800, Haren Myneni wrote:
> On Wed, 2019-12-18 at 18:18 +1100, Oliver O'Halloran wrote:
> > On Wed, Nov 27, 2019 at 12:07 PM Haren Myneni  
> > wrote:
> > >
> > > *snip*
> > >
> > > @@ -36,7 +62,18 @@ static int init_vas_instance(struct platform_device 
> > > *pdev)
> > > return -ENODEV;
> > > }
> > >
> > > -   if (pdev->num_resources != 4) {
> > > +   rc = of_property_read_u64(dn, "ibm,vas-port", );
> > > +   if (rc) {
> > > +   pr_err("No ibm,vas-port property for %s?\n", pdev->name);
> > > +   /* No interrupts property */
> > > +   nresources = 4;
> > > +   }
> > > +
> > > +   /*
> > > +* interrupts property is available with 'ibm,vas-port' property.
> > > +* 4 Resources and 1 IRQ if interrupts property is available.
> > > +*/
> > > +   if (pdev->num_resources != nresources) {
> > > pr_err("Unexpected DT configuration for [%s, %d]\n",
> > > pdev->name, vasid);
> > > return -ENODEV;
> > 
> > Right, so adding the IRQ in firmware will break the VAS driver in
> > existing kernels since it changes the resource count. This is IMO a
> > bug in the VAS driver that you should fix, but it does mean we need to
> > think twice about having firmware assign an interrupt at boot.
> 
> Correct, Hence added vas-user-space nvram switch in skiboot.  
> 
> > 
> > I had a closer look at this series and I'm not convinced that any
> > firmware changes are actually required either. We already have OPAL
> > calls for allocating an hwirq for the kernel to use and for getting
> > the IRQ's XIVE trigger port (see pnv_ocxl_alloc_xive_irq() for an
> > example). Why not use those here too? Doing so would allow us to
> > assign interrupts to individual windows too which might be useful for
> > the windows used by the kernel.
> 
> Thanks for the pointer. like using pnv_ocxl_alloc_xive_irq(), we can
> disregard FW change. BTW, VAS fault handling is needed only for user
> space VAS windows. 
> 
>  int vas_alloc_xive_irq(u32 chipid, u32 *irq, u64 *trigger_addr)
> {
> __be64 flags, trigger_page;
> u32 hwirq;
> s64 rc;
> 
> hwirq = opal_xive_allocate_irq_raw(chipid);
> if (hwirq < 0)
> return -ENOENT;
> 
> rc = opal_xive_get_irq_info(hwirq, , NULL, _page,
> NULL,
> NULL);
> if (rc || !trigger_page) {
> xive_native_free_irq(hwirq);
> return -ENOENT;
> }
> 
> *irq = hwirq;
> *trigger_addr = be64_to_cpu(trigger_page);
> return 0;
> }
> 
> We can have common function for VAS and cxl except per chip IRQ
> allocation is needed for each VAS instance. I will post patch-set with
> this change.
> 

power9 will have only XIVE interrupt controller including on open-power
systems. Correct?

VAS need per chip IRQ allocation. The current interfaces (ex:
xive_native_alloc_irq(void)) allocates IRQ on any chip
(OPAL_XIVE_ANY_CHIP)
So to use these interfaces for VAS, any concerns with the following
patch:
Changes: passing chip_id to xive_native_alloc_irq() and define
xive_native_alloc_get_irq_info() in xive/native.c which can be used in
ocxl and VAS.

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 24cdf97..b310062 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -108,7 +108,7 @@ struct xive_q {
 extern int xive_native_populate_irq_data(u32 hw_irq,
 struct xive_irq_data *data);
 extern void xive_cleanup_irq_data(struct xive_irq_data *xd);
-extern u32 xive_native_alloc_irq(void);
+extern u32 xive_native_alloc_irq(u32 chip_id);
 extern void xive_native_free_irq(u32 irq);
 extern int xive_native_configure_irq(u32 hw_irq, u32 target, u8 prio, u32 
sw_irq);
 
@@ -137,7 +137,8 @@ extern int xive_native_set_queue_state(u32 vp_id, uint32_t 
prio, u32 qtoggle,
   u32 qindex);
 extern int xive_native_get_vp_state(u32 vp_id, u64 *out_state);
 extern bool xive_native_has_queue_state_support(void);
-
+extern int xive_native_alloc_get_irq_info(u32 chip_id, u32 *irq,
+   u64 *trigger_addr);
 #else
 
 static inline bool xive_enabled(void) { return false; }
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 66858b7..59009e1 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1299,7 +1299,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
vcpu->arch.xive_cam_word = cpu_to_be32(xc->vp_cam | TM_QW1W2_VO);
 
/* Allocate IPI */
-   xc->vp_ipi = xive_native_alloc_irq();
+   xc->vp_ipi = xive_native_alloc_irq(OPAL_XIVE_ANY_CHIP);
if (!xc->vp_ipi) {
pr_err("Failed to allocate xive irq for VCPU IPI\n");
r 

Re: [PATCH] powerpc/mpc85xx: also write addr_h to spin table for 64bit boot entry

2019-12-19 Thread Scott Wood
On Mon, 2019-11-25 at 23:15 +0800, yingjie_...@126.com wrote:
> From: Bai Yingjie 
> 
> CPU like P4080 has 36bit physical address, its DDR physical
> start address can be configured above 4G by LAW registers.
> 
> For such systems in which their physical memory start address was
> configured higher than 4G, we need also to write addr_h into the spin
> table of the target secondary CPU, so that addr_h and addr_l together
> represent a 64bit physical address.
> Otherwise the secondary core can not get correct entry to start from.
> 
> This should do no harm for normal case where addr_h is all 0.
> 
> Signed-off-by: Bai Yingjie 
> ---
>  arch/powerpc/platforms/85xx/smp.c | 8 
>  1 file changed, 8 insertions(+)

Acked-by: Scott Wood 

-Scott




Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-19 Thread Jason Gunthorpe
On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > Hi,
> > > 
> > > This implements an API naming change (put_user_page*() -->
> > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > extends that tracking to a few select subsystems. More subsystems will
> > > be added in follow up work.
> > 
> > Hi John,
> > 
> > The patchset generates kernel panics in our IB testing. In our tests, we
> > allocated single memory block and registered multiple MRs using the single
> > block.
> > 
> > The possible bad flow is:
> >   ib_umem_geti() ->
> >pin_user_pages_fast(FOLL_WRITE) ->
> > internal_get_user_pages_fast(FOLL_WRITE) ->
> >  gup_pgd_range() ->
> >   gup_huge_pd() ->
> >gup_hugepte() ->
> > try_grab_compound_head() ->
> 
> Hi Leon,
> 
> Thanks very much for the detailed report! So we're overflowing...
> 
> At first look, this seems likely to be hitting a weak point in the
> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> 99-121). Basically it's pretty easy to overflow the page->_refcount
> with huge pages if the pages have a *lot* of subpages.
> 
> We can only do about 7 pins on 1GB huge pages that use 4KB subpages.

Considering that establishing these pins is entirely under user
control, we can't have a limit here.

If the number of allowed pins are exhausted then the
pin_user_pages_fast() must fail back to the user.

> 3. It would be nice if I could reproduce this. I have a two-node mlx5 
> Infiniband
> test setup, but I have done only the tiniest bit of user space IB coding, so
> if you have any test programs that aren't too hard to deal with that could
> possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
> that I'm not an advanced IB programmer. At all. :)

Clone this:

https://github.com/linux-rdma/rdma-core.git

Install all the required deps to build it (notably cython), see the README.md

$ ./build.sh
$ build/bin/run_tests.py 

If you get things that far I think Leon can get a reproduction for you

Jason


[PATCH v4 2/2] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-12-19 Thread Sukadev Bhattiprolu
Implement the H_SVM_INIT_ABORT hcall which the Ultravisor can use to
abort an SVM after it has issued the H_SVM_INIT_START and before the
H_SVM_INIT_DONE hcalls. This hcall could be used when Ultravisor
encounters security violations or other errors when starting an SVM.

Note that this hcall is different from UV_SVM_TERMINATE ucall which
is used by HV to terminate/cleanup an VM that has becore secure.

The H_SVM_INIT_ABORT should basically undo operations that were done
since the H_SVM_INIT_START hcall - i.e page-out all the VM pages back
to normal memory, and terminate the SVM.

(If we do not bring the pages back to normal memory, the text/data
of the VM would be stuck in secure memory and since the SVM did not
go secure, its MSR_S bit will be clear and the VM wont be able to
access its pages even to do a clean exit).

Based on patches and discussion with Paul Mackerras, Ram Pai and
Bharata Rao.

Signed-off-by: Ram Pai 
Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Bharata B Rao 
---
Changelog[v4]:
- [Bharata Rao] Add missing rcu locking
- [Paul Mackerras] simplify code that walks memslots
- Add a check to ensure that H_SVM_INIT_ABORT is called before
  H_SVM_INIT_DONE hcall (i.e the SVM is not already secure).

Changelog[v3]:
- Rather than pass the NIP/MSR as parameters, load them into
  SRR0/SRR1 (like we do with other registers) and terminate
  the VM after paging out pages
- Move the code to add a skip_page_out parameter into a
  separate patch.

Changelog[v2]:
[Paul Mackerras] avoid returning to UV "one last time" after
the state is cleaned up.  So, we now have H_SVM_INIT_ABORT:
- take the VM's NIP/MSR register states as parameters
- inherit the state of other registers as at UV_ESM call.
After cleaning up the partial state, HV uses these to return
directly to the VM with a failed UV_ESM call.
---
 Documentation/powerpc/ultravisor.rst| 57 +
 arch/powerpc/include/asm/hvcall.h   |  1 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  6 +++
 arch/powerpc/include/asm/kvm_host.h |  1 +
 arch/powerpc/kvm/book3s_hv.c|  3 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 26 ++
 6 files changed, 94 insertions(+)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
index 730854f73830..8c114c071bfa 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -948,6 +948,63 @@ Use cases
 up its internal state for this virtual machine.
 
 
+H_SVM_INIT_ABORT
+
+
+Abort the process of securing an SVM.
+
+Syntax
+~~
+
+.. code-block:: c
+
+   uint64_t hypercall(const uint64_t H_SVM_INIT_ABORT)
+
+Return values
+~
+
+One of the following values:
+
+   * H_PARAMETER   on successfully cleaning up the state,
+   Hypervisor will return this value to the
+   **guest**, to indicate that the underlying
+   UV_ESM ultracall failed.
+
+   * H_UNSUPPORTED if called from the wrong context (e.g. from
+   an SVM or before an H_SVM_INIT_START hypercall).
+
+Description
+~~~
+
+Abort the process of securing a virtual machine. This call must
+be made after a prior call to ``H_SVM_INIT_START`` hypercall and
+before a call to ``H_SVM_INIT_DONE``.
+
+On entry into this hypercall the non-volatile GPRs and FPRs are
+expected to contain the values they had at the time the VM issued
+the UV_ESM ultracall. Further ``SRR0`` is expected to contain the
+address of the instruction after the ``UV_ESM`` ultracall and ``SRR1``
+the MSR value with which to return to the VM.
+
+This hypercall will cleanup any partial state that was established for
+the VM since the prior ``H_SVM_INIT_START`` hypercall, including paging
+out pages that were paged-into secure memory, and issue the
+``UV_SVM_TERMINATE`` ultracall to terminate the VM.
+
+After the partial state is cleaned up, control returns to the VM
+(**not Ultravisor**), at the address specified in ``SRR0`` with the
+MSR values set to the value in ``SRR1``.
+
+Use cases
+~
+
+If after a successful call to ``H_SVM_INIT_START``, the Ultravisor
+encounters an error while securing a virtual machine, either due
+to lack of resources or because the VM's security information could
+not be validated, Ultravisor informs the Hypervisor about it.
+Hypervisor should use this call to clean up any internal state for
+this virtual machine and return to the VM.
+
 H_SVM_PAGE_IN
 -
 
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 13bd870609c3..e90c073e437e 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ 

[PATCH v4 1/2] KVM: PPC: Add skip_page_out parameter

2019-12-19 Thread Sukadev Bhattiprolu
Add 'skip_page_out' parameter to kvmppc_uvmem_drop_pages() so the
callers can specify whetheter or not to skip paging out pages. This
will be needed in a follow-on patch that implements H_SVM_INIT_ABORT
hcall

Reviewed-by: Paul Mackerras 
Signed-off-by: Sukadev Bhattiprolu 
---
 arch/powerpc/include/asm/kvm_book3s_uvmem.h | 4 ++--
 arch/powerpc/kvm/book3s_64_mmu_radix.c  | 2 +-
 arch/powerpc/kvm/book3s_hv.c| 2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 4 ++--
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 50204e228f16..3cf8425b9838 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -20,7 +20,7 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
 unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
 int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn);
 void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
-struct kvm *kvm);
+struct kvm *kvm, bool skip_page_out);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
@@ -69,6 +69,6 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, 
unsigned long gfn)
 
 static inline void
 kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
-   struct kvm *kvm) { }
+   struct kvm *kvm, bool skip_page_out) { }
 #endif /* CONFIG_PPC_UV */
 #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index da857c8ba6e4..744dba98e5d1 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1102,7 +1102,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
unsigned int shift;
 
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START)
-   kvmppc_uvmem_drop_pages(memslot, kvm);
+   kvmppc_uvmem_drop_pages(memslot, kvm, true);
 
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 597f4bfecf0e..66d5312be16b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5493,7 +5493,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
continue;
 
kvm_for_each_memslot(memslot, slots) {
-   kvmppc_uvmem_drop_pages(memslot, kvm);
+   kvmppc_uvmem_drop_pages(memslot, kvm, true);
uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
}
}
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index f24ac3cfb34c..9a5bbad7d87e 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -259,7 +259,7 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
  * QEMU page table with normal PTEs from newly allocated pages.
  */
 void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
-struct kvm *kvm)
+struct kvm *kvm, bool skip_page_out)
 {
int i;
struct kvmppc_uvmem_page_pvt *pvt;
@@ -277,7 +277,7 @@ void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot 
*free,
 
uvmem_page = pfn_to_page(uvmem_pfn);
pvt = uvmem_page->zone_device_data;
-   pvt->skip_page_out = true;
+   pvt->skip_page_out = skip_page_out;
mutex_unlock(>arch.uvmem_lock);
 
pfn = gfn_to_pfn(kvm, gfn);
-- 
2.17.2



Re: [PATCH V3 2/2] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-12-19 Thread Sukadev Bhattiprolu
Paul Mackerras [pau...@ozlabs.org] wrote:
> On Sat, Dec 14, 2019 at 06:12:08PM -0800, Sukadev Bhattiprolu wrote:
> > 
> > Implement the H_SVM_INIT_ABORT hcall which the Ultravisor can use to
> > abort an SVM after it has issued the H_SVM_INIT_START and before the
> > H_SVM_INIT_DONE hcalls. This hcall could be used when Ultravisor
> > encounters security violations or other errors when starting an SVM.
> > 
> > Note that this hcall is different from UV_SVM_TERMINATE ucall which
> > is used by HV to terminate/cleanup an VM that has becore secure.
> > 
> > The H_SVM_INIT_ABORT should basically undo operations that were done
> > since the H_SVM_INIT_START hcall - i.e page-out all the VM pages back
> > to normal memory, and terminate the SVM.
> > 
> > (If we do not bring the pages back to normal memory, the text/data
> > of the VM would be stuck in secure memory and since the SVM did not
> > go secure, its MSR_S bit will be clear and the VM wont be able to
> > access its pages even to do a clean exit).
> > 
> > Based on patches and discussion with Paul Mackerras, Ram Pai and
> > Bharata Rao.
> > 
> > Signed-off-by: Ram Pai 
> > Signed-off-by: Sukadev Bhattiprolu 
> > Signed-off-by: Bharata B Rao 
> 
> Minor comment below, but not a showstopper.  Also, as Bharata noted
> you need to hold the srcu lock for reading.

Yes, I fixed that.

> 
> > +   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > +   struct kvm_memory_slot *memslot;
> > +   struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> > +
> > +   if (!slots)
> > +   continue;
> > +
> > +   kvm_for_each_memslot(memslot, slots)
> > +   kvmppc_uvmem_drop_pages(memslot, kvm, false);
> > +   }
> 
> Since we use the default KVM_ADDRESS_SPACE_NUM, which is 1, this code
> isn't wrong but it is more verbose than it needs to be.  It could be
> 
>   kvm_for_each_memslot(kvm_memslots(kvm), slots)
>   kvmppc_uvmem_drop_pages(memslot, kvm, false);

and simplified this.

Thanks.

Sukadev


[PATCH] serial: ucc_uart: remove redundant assignment to pointer bdp

2019-12-19 Thread Colin King
From: Colin Ian King 

The variable bdp is being initialized with a value that is never
read and it is being updated later with a new value. The initialization
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King 
---
 drivers/tty/serial/ucc_uart.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/tty/serial/ucc_uart.c b/drivers/tty/serial/ucc_uart.c
index afc2a5d69202..99a069ed3636 100644
--- a/drivers/tty/serial/ucc_uart.c
+++ b/drivers/tty/serial/ucc_uart.c
@@ -336,8 +336,6 @@ static int qe_uart_tx_pump(struct uart_qe_port *qe_port)
struct uart_port *port = _port->port;
struct circ_buf *xmit = >state->xmit;
 
-   bdp = qe_port->rx_cur;
-
/* Handle xon/xoff */
if (port->x_char) {
/* Pick next descriptor and fill from buffer */
-- 
2.24.0



Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-19 Thread John Hubbard

On 12/19/19 5:26 AM, Leon Romanovsky wrote:

On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:

Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.


Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
  ib_umem_geti() ->
   pin_user_pages_fast(FOLL_WRITE) ->
internal_get_user_pages_fast(FOLL_WRITE) ->
 gup_pgd_range() ->
  gup_huge_pd() ->
   gup_hugepte() ->
try_grab_compound_head() ->


Hi Leon,

Thanks very much for the detailed report! So we're overflowing...

At first look, this seems likely to be hitting a weak point in the
GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
(there's a writeup in Documentation/core-api/pin_user_page.rst, lines
99-121). Basically it's pretty easy to overflow the page->_refcount
with huge pages if the pages have a *lot* of subpages.

We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
Do you have any idea how many pins (repeated pins on the same page, which
it sounds like you have) might be involved in your test case,
and the huge page and system page sizes? That would allow calculating
if we're likely overflowing for that reason.

So, ideas and next steps:

1. Assuming that you *are* hitting this, I think I may have to fall back to
implementing the "deferred" part of this design, as part of this series, after
all. That means:

  For the pin/unpin calls at least, stop treating all pages as if they are
  a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
  That's not how it works now, and the need to hand back a huge array of
  subpages is part of the problem. This affects the callers too, so it's not
  a super quick change to make. (I was really hoping not to have to do this
  yet.)

2. OR, maybe if you're *close* the the overflow, I could buy some development
time by moving the boundary between pinned vs get_page() refcounts, by
reducing GUP_PIN_COUNTING_BIAS. That's less robust, but I don't want
to rule it out just yet. After all, 1024 is a big chunk to count up with,
and even if get_page() calls add up to, say, 512 refs on a page, it's still
just a false positive on page_dma_pinned(). And false positives, if transient,
are OK.

3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
test setup, but I have done only the tiniest bit of user space IB coding, so
if you have any test programs that aren't too hard to deal with that could
possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
that I'm not an advanced IB programmer. At all. :)

4. (minor note to self) This also uncovers a minor weakness in diagnostics:
there's no page dump in these reports, because I chickened out and didn't
include my WARN_ONCE_PAGE() macro that would have provided it. Although,
even without it, it's obvious that this is a page overflow.


thanks,
--
John Hubbard
NVIDIA




  108 static __maybe_unused struct page *try_grab_compound_head(struct page 
*page,
  109   int refs,
  110   unsigned int 
flags)
  111 {
  112 if (flags & FOLL_GET)
  113 return try_get_compound_head(page, refs);
  114 else if (flags & FOLL_PIN)
  115 return try_pin_compound_head(page, refs);
  116
  117 WARN_ON_ONCE(1);
  118 return NULL;
  119 }

# (master) $ dmesg
[10924.70] mlx5_core :00:08.0 eth2: Link up
[10924.725383] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[10960.902254] [ cut here ]
[10960.905614] WARNING: CPU: 3 PID: 8838 at mm/gup.c:61 
try_grab_compound_head+0x92/0xd0
[10960.907313] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss 
nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt 
target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm 
ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc 
virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 
ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core 
sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last 
unloaded: mlxkvl]
[10960.917806] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G   OE 
5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1
[10960.920530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[10960.923024] RIP: 0010:try_grab_compound_head+0x92/0xd0
[10960.924329] Code: e4 8d 14 06 48 8d 4f 34 f0 0f b1 57 34 0f 94 c2 84 d2 75 cb 85 
c0 74 cd 8d 14 06 f0 0f b1 11 0f 94 c2 84 

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-19 Thread John Hubbard

On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
...

3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
test setup, but I have done only the tiniest bit of user space IB coding, so
if you have any test programs that aren't too hard to deal with that could
possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
that I'm not an advanced IB programmer. At all. :)


Clone this:

https://github.com/linux-rdma/rdma-core.git

Install all the required deps to build it (notably cython), see the README.md

$ ./build.sh
$ build/bin/run_tests.py

If you get things that far I think Leon can get a reproduction for you



Cool, it's up and running (1 failure, 3 skipped, out of 67 tests).

This is a great test suite to have running, I'll add it to my scripts. Here's 
the
full output in case the failure or skip cases are a problem:

$ sudo ./build/bin/run_tests.py --verbose

test_create_ah (tests.test_addr.AHTest) ... ok
test_create_ah_roce (tests.test_addr.AHTest) ... skipped "Can't run RoCE tests on IB 
link layer"
test_destroy_ah (tests.test_addr.AHTest) ... ok
test_create_comp_channel (tests.test_cq.CCTest) ... ok
test_destroy_comp_channel (tests.test_cq.CCTest) ... ok
test_create_cq_ex (tests.test_cq.CQEXTest) ... ok
test_create_cq_ex_bad_flow (tests.test_cq.CQEXTest) ... ok
test_destroy_cq_ex (tests.test_cq.CQEXTest) ... ok
test_create_cq (tests.test_cq.CQTest) ... ok
test_create_cq_bad_flow (tests.test_cq.CQTest) ... ok
test_destroy_cq (tests.test_cq.CQTest) ... ok
test_rc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
test_ud_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
test_create_dm (tests.test_device.DMTest) ... ok
test_create_dm_bad_flow (tests.test_device.DMTest) ... ok
test_destroy_dm (tests.test_device.DMTest) ... ok
test_destroy_dm_bad_flow (tests.test_device.DMTest) ... ok
test_dm_read (tests.test_device.DMTest) ... ok
test_dm_write (tests.test_device.DMTest) ... ok
test_dm_write_bad_flow (tests.test_device.DMTest) ... ok
test_dev_list (tests.test_device.DeviceTest) ... ok
test_open_dev (tests.test_device.DeviceTest) ... ok
test_query_device (tests.test_device.DeviceTest) ... ok
test_query_device_ex (tests.test_device.DeviceTest) ... ok
test_query_gid (tests.test_device.DeviceTest) ... ok
test_query_port (tests.test_device.DeviceTest) ... FAIL
test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok
test_create_dm_mr (tests.test_mr.DMMRTest) ... ok
test_destroy_dm_mr (tests.test_mr.DMMRTest) ... ok
test_buffer (tests.test_mr.MRTest) ... ok
test_dereg_mr (tests.test_mr.MRTest) ... ok
test_dereg_mr_twice (tests.test_mr.MRTest) ... ok
test_lkey (tests.test_mr.MRTest) ... ok
test_read (tests.test_mr.MRTest) ... ok
test_reg_mr (tests.test_mr.MRTest) ... ok
test_reg_mr_bad_flags (tests.test_mr.MRTest) ... ok
test_reg_mr_bad_flow (tests.test_mr.MRTest) ... ok
test_rkey (tests.test_mr.MRTest) ... ok
test_write (tests.test_mr.MRTest) ... ok
test_dereg_mw_type1 (tests.test_mr.MWTest) ... ok
test_dereg_mw_type2 (tests.test_mr.MWTest) ... ok
test_reg_mw_type1 (tests.test_mr.MWTest) ... ok
test_reg_mw_type2 (tests.test_mr.MWTest) ... ok
test_reg_mw_wrong_type (tests.test_mr.MWTest) ... ok
test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... skipped 'ODP is not 
supported - ODP recv not supported'
test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok
test_default_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
test_mem_align_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
test_without_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
test_alloc_pd (tests.test_pd.PDTest) ... ok
test_create_pd_none_ctx (tests.test_pd.PDTest) ... ok
test_dealloc_pd (tests.test_pd.PDTest) ... ok
test_destroy_pd_twice (tests.test_pd.PDTest) ... ok
test_multiple_pd_creation (tests.test_pd.PDTest) ... ok
test_create_qp_ex_no_attr (tests.test_qp.QPTest) ... ok
test_create_qp_ex_no_attr_connected (tests.test_qp.QPTest) ... ok
test_create_qp_ex_with_attr (tests.test_qp.QPTest) ... ok
test_create_qp_ex_with_attr_connected (tests.test_qp.QPTest) ... ok
test_create_qp_no_attr (tests.test_qp.QPTest) ... ok
test_create_qp_no_attr_connected (tests.test_qp.QPTest) ... ok
test_create_qp_with_attr (tests.test_qp.QPTest) ... ok
test_create_qp_with_attr_connected (tests.test_qp.QPTest) ... ok
test_modify_qp (tests.test_qp.QPTest) ... ok
test_query_qp (tests.test_qp.QPTest) ... ok
test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No devices 
with net interface'

==
FAIL: test_query_port (tests.test_device.DeviceTest)
--
Traceback (most recent call last):
  File "/kernel_work/rdma-core/tests/test_device.py", line 129, in 
test_query_port

Re: [PATCH] powerpc/85xx: Get twr_p102x to compile again

2019-12-19 Thread Scott Wood
On Thu, 2019-12-19 at 16:16 +0100, Sebastian Andrzej Siewior wrote:
> With CONFIG_QUICC_ENGINE enabled and CONFIG_UCC_GETH + CONFIG_SERIAL_QE
> disabled we have an unused variable (np). The code won't compile with
> -Werror.
> 
> Move the np variable to the block where it is actually used.
> 
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
>  arch/powerpc/platforms/85xx/twr_p102x.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)

Acked-by: Scott Wood 

-Scott




Re: [PATCH 1/1] kvm/book3s_64: Fixes crash caused by not cleaning vhost IOTLB

2019-12-19 Thread Alexey Kardashevskiy



On 19/12/2019 10:28, Leonardo Bras wrote:
> On Wed, 2019-12-18 at 15:53 +1100, Alexey Kardashevskiy wrote:
>> H_STUFF_TCE is always called with 0. Well, may be some AIX somewhere
>> calls it with a value other than zero, and I probably saw some other
>> value somewhere but in QEMU/KVM case it is 0 so you effectively disable
>> in-kernel acceleration of H_STUFF_TCE which is
>> undesirable.
>>
> 
> Thanks for the feedback!
> 
>> For now we should disable in-kernel H_STUFF_TCE/... handlers in QEMU
>> just like we do for VFIO for older host kernels:
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_iommu.c;h=3d3bcc86496a5277d62f7855fbb09c013c015f27;hb=HEAD#l208
>  
> I am still reading into this temporary solution, I could still not
> understand how it works.
> 
>> I am not sure what a proper solution would be, something like an eventfd
>> and KVM's kvmppc_h_stuff_tce() signaling vhost that the latter needs to
>> invalidate iotlbs. Or we can just say that we do not allow KVM
>> acceleration if there is vhost+iommu on the same liobn (== vPHB, pretty
>> much). Thanks,
> 
> I am not used to eventfd, but i agree it's a valid solution to talk to
> QEMU and then use it to send a message via /dev/vhost.
> KVM -> QEMU -> vhost
> 
> But I can't get my mind out of another solution: doing it in
> kernelspace.  I am not sure how that would work, though.
> 
> If I could understand correctly, there is a vhost IOTLB per vhost_dev,
> and H_STUFF_TCE is not called in 64-bit DMA case (for tce_value == 0
> case, at least), which makes sense, given it doesn't need to invalidate
> entries on IOTLB.
> 
> So, we would need to somehow replace `return H_TOO_HARD` in this patch
> with code that could call vhost_process_iotlb_msg() with
> VHOST_IOTLB_INVALIDATE.
> 
> For that, I would need to know what are the vhost_dev's of that
> process, which I don't know if it's possible to do currently (or safe
> at all).
> 
> I am thinking of linking all vhost_dev's with a list (list.h) that
> could be searched, comparing `mm_struct *` of the calling task with all
> vhost_dev's, and removing the entry of all IOTLB that hits.
> 
> Not sure if that's the best approach to find the related vhost_dev's.
> 
> What do you think?


As discussed in slack, we need to do the same thing we do with physical
devices when we invalidate hardware IOMMU translation caches via
tbl->it_ops->tce_kill. The problem to solve now is how we tell KVM/PPC
about vhost/iotlb (is there an fd?), something similar to the existing
KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE. I guess x86 handles all the mappings
in QEMU and therefore they do not have this problem. Thanks,


-- 
Alexey


Re: [PATCH 08/10] crypto/NX: Add NX GZIP user space API

2019-12-19 Thread Herbert Xu
On Thu, Dec 19, 2019 at 12:49:44AM -0800, Haren Myneni wrote:
> 
> Virtual Accelerator Switchboard (VAS) can provide support different
> accelerators, Right now only NX is used, but possible to extend to
> others in future. Or different functionalities such as fast thread
> wakeup (VAS feature) with VAS windows. 
> 
> So looking common VAS API for any its accelerators. Need open a window /
> channel - open() and ioctl()) calls, and setup the communications with
> mapping address to NX (mmap()) and close the window. Then user space
> communicates to accelerator directly without kernel involvement.
> Specific drivers should set window attributes such as how many requests
> can be send at same time and etc. All other interfaces should be same
> for any accelerator. 
> 
> Also, since user space sends requests directly, should restrict
> malicious users to prevent overload NX (security issue). Allowing
> sysadmin to restrict /dev/crypto/nx-gzip usage. 

If you are going to place your driver through the Crypto API then
it needs to use the Crypto API interface for user-space access.
That interface is af_alg.

If this is not a good fit then I suggest that you move your API
elsewhere, perhaps to the powerpc tree where the user-space API can
then be properly reviewed.

It is not feasible to review your driver's user-space API through
the crypto tree.

> As you suggested, SW crypto API (af_alg) can be used just for NX
> compression like using API based on the accelerator functionalities. It
> is socket based API with AF_ALG socket family. But is there a way for
> sysadmin to restrict usage from user space? Need just few functions in
> struct proto. 

The af_alg interface does not operate in the manner that you
describe.  It is an interface that maps onto the underlying kernel
Crypto API operations.  We currently don't have an af_alg module
for compression, but if we did we would be closely following the
current kernel compression interface.

One key feature of af_alg is that it normally is agnostic to the
underlying implementation.  That is, even when the hardware is
absent it would seamlessly switch over to a software implementation.

I say normally because there can be exceptions, e.g., with paes
and hardware keys.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[Bug 205283] BUG: KASAN: global-out-of-bounds in _copy_to_iter+0x3d4/0x5a8

2019-12-19 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205283

--- Comment #4 from Christophe Leroy (christophe.le...@c-s.fr) ---
Can you apply https://patchwork.ozlabs.org/patch/1213028/ and select
CONFIG_KASAN_VMALLOC

-- 
You are receiving this mail because:
You are watching someone on the CC list of the bug.

Re: [PATCH 03/18] powerpc: Add PREFIXED SRR1 bit for future ISA version

2019-12-19 Thread Jordan Niethe
On Wed, Dec 18, 2019 at 7:23 PM Daniel Axtens  wrote:
>
> Jordan Niethe  writes:
>
> > Add the bit definition for exceptions caused by prefixed instructions.
> >
> > Signed-off-by: Jordan Niethe 
> > ---
> >  arch/powerpc/include/asm/reg.h | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> > index 6f9fcc3d4c82..0a6d39fb4769 100644
> > --- a/arch/powerpc/include/asm/reg.h
> > +++ b/arch/powerpc/include/asm/reg.h
> > @@ -778,6 +778,7 @@
> >
> >  #define   SRR1_MCE_MCP   0x0008 /* Machine check signal 
> > caused interrupt */
> >  #define   SRR1_BOUNDARY  0x1000 /* Prefixed instruction 
> > crosses 64-byte boundary */
> > +#define   SRR1_PREFIXED  0x2000 /* Exception caused by 
> > prefixed instruction */
>
> You could probably squash this with the previous patch, and maybe the
> next patch too.
>
> Regards,
> Daniel
>
> >
> >  #define SPRN_HSRR0   0x13A   /* Save/Restore Register 0 */
> >  #define SPRN_HSRR1   0x13B   /* Save/Restore Register 1 */
> > --
> > 2.20.1
Thanks, good idea.


Re: [PATCH] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()

2019-12-19 Thread Alexey Kardashevskiy



On 11/12/2019 21:42, Jan Kara wrote:
> The last jump to free_exit in mm_iommu_do_alloc() happens after page
> pointers in struct mm_iommu_table_group_mem_t were already converted to
> physical addresses. Thus calling put_page() on these physical addresses
> will likely crash. Convert physical addresses back to page pointers
> during the error cleanup.
> 
> Signed-off-by: Jan Kara 
> ---
>  arch/powerpc/mm/book3s64/iommu_api.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
>  Beware, this is completely untested, spotted just by code audit.
> 
> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
> b/arch/powerpc/mm/book3s64/iommu_api.c
> index 56cc84520577..06c403381c9c 100644
> --- a/arch/powerpc/mm/book3s64/iommu_api.c
> +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> @@ -154,7 +154,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>  (mem2->entries << PAGE_SHIFT {
>   ret = -EINVAL;
>   mutex_unlock(_list_mutex);
> - goto free_exit;
> + goto convert_exit;
>   }
>   }
>  
> @@ -166,6 +166,9 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>  
>   return 0;
>  
> +convert_exit:
> + for (i = 0; i < pinned; i++)
> + mem->hpages[i] = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);


Good find. Although doing it where you added "goto convert_exit" seems
cleaner imho. Thanks,


>  free_exit:
>   /* free the reference taken */
>   for (i = 0; i < pinned; i++)
> 

-- 
Alexey


Re: [PATCH 05/18] powerpc sstep: Prepare to support prefixed instructions

2019-12-19 Thread Jordan Niethe
On Wed, Dec 18, 2019 at 7:35 PM Daniel Axtens  wrote:
>
> Jordan Niethe  writes:
>
> > Currently all instructions are a single word long. A future ISA version
> > will include prefixed instructions which have a double word length. The
> > functions used for analysing and emulating instructions need to be
> > modified so that they can handle these new instruction types.
> >
> > A prefixed instruction is a word prefix followed by a word suffix. All
> > prefixes uniquely have the primary op-code 1. Suffixes may be valid word
> > instructions or instructions that only exist as suffixes.
> >
> > In handling prefixed instructions it will be convenient to treat the
> > suffix and prefix as separate words. To facilitate this modify
> > analyse_instr() and emulate_step() to take a take a suffix as a
> > parameter. For word instructions it does not matter what is passed in
> > here - it will be ignored.
> >
> > We also define a new flag, PREFIXED, to be used in instruction_op:type.
> > This flag will indicate when emulating an analysed instruction if the
> > NIP should be advanced by word length or double word length.
> >
> > The callers of analyse_instr() and emulate_step() will need their own
> > changes to be able to support prefixed instructions. For now modify them
> > to pass in 0 as a suffix.
> >
> > Note that at this point no prefixed instructions are emulated or
> > analysed - this is just making it possible to do so.
> >
> > Signed-off-by: Jordan Niethe 
> > ---
> >  arch/powerpc/include/asm/ppc-opcode.h |  3 +++
> >  arch/powerpc/include/asm/sstep.h  |  8 +--
> >  arch/powerpc/include/asm/uaccess.h| 30 +++
> >  arch/powerpc/kernel/align.c   |  2 +-
> >  arch/powerpc/kernel/hw_breakpoint.c   |  4 ++--
> >  arch/powerpc/kernel/kprobes.c |  2 +-
> >  arch/powerpc/kernel/mce_power.c   |  2 +-
> >  arch/powerpc/kernel/optprobes.c   |  2 +-
> >  arch/powerpc/kernel/uprobes.c |  2 +-
> >  arch/powerpc/kvm/emulate_loadstore.c  |  2 +-
> >  arch/powerpc/lib/sstep.c  | 12 ++-
> >  arch/powerpc/lib/test_emulate_step.c  | 30 +--
> >  arch/powerpc/xmon/xmon.c  |  4 ++--
> >  13 files changed, 71 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
> > b/arch/powerpc/include/asm/ppc-opcode.h
> > index c1df75edde44..a1dfa4bdd22f 100644
> > --- a/arch/powerpc/include/asm/ppc-opcode.h
> > +++ b/arch/powerpc/include/asm/ppc-opcode.h
> > @@ -377,6 +377,9 @@
> >  #define PPC_INST_VCMPEQUD0x10c7
> >  #define PPC_INST_VCMPEQUB0x1006
> >
> > +/* macro to check if a word is a prefix */
> > +#define IS_PREFIX(x) (((x) >> 26) == 1)
> > +
> >  /* macros to insert fields into opcodes */
> >  #define ___PPC_RA(a) (((a) & 0x1f) << 16)
> >  #define ___PPC_RB(b) (((b) & 0x1f) << 11)
> > diff --git a/arch/powerpc/include/asm/sstep.h 
> > b/arch/powerpc/include/asm/sstep.h
> > index 769f055509c9..6d4cb602e231 100644
> > --- a/arch/powerpc/include/asm/sstep.h
> > +++ b/arch/powerpc/include/asm/sstep.h
> > @@ -89,6 +89,9 @@ enum instruction_type {
> >  #define VSX_LDLEFT   4   /* load VSX register from left */
> >  #define VSX_CHECK_VEC8   /* check MSR_VEC not MSR_VSX for reg 
> > >= 32 */
> >
> > +/* Prefixed flag, ORed in with type */
> > +#define PREFIXED 0x800
> > +
> >  /* Size field in type word */
> >  #define SIZE(n)  ((n) << 12)
> >  #define GETSIZE(w)   ((w) >> 12)
> > @@ -132,7 +135,7 @@ union vsx_reg {
> >   * otherwise.
> >   */
> >  extern int analyse_instr(struct instruction_op *op, const struct pt_regs 
> > *regs,
> > -  unsigned int instr);
> > +  unsigned int instr, unsigned int sufx);
> >
> >  /*
> >   * Emulate an instruction that can be executed just by updating
> > @@ -149,7 +152,8 @@ void emulate_update_regs(struct pt_regs *reg, struct 
> > instruction_op *op);
> >   * 0 if it could not be emulated, or -1 for an instruction that
> >   * should not be emulated (rfid, mtmsrd clearing MSR_RI, etc.).
> >   */
> > -extern int emulate_step(struct pt_regs *regs, unsigned int instr);
> > +extern int emulate_step(struct pt_regs *regs, unsigned int instr,
> > + unsigned int sufx);
> >
> >  /*
> >   * Emulate a load or store instruction by reading/writing the
> > diff --git a/arch/powerpc/include/asm/uaccess.h 
> > b/arch/powerpc/include/asm/uaccess.h
> > index 15002b51ff18..bc585399e0c7 100644
> > --- a/arch/powerpc/include/asm/uaccess.h
> > +++ b/arch/powerpc/include/asm/uaccess.h
> > @@ -423,4 +423,34 @@ extern long __copy_from_user_flushcache(void *dst, 
> > const void __user *src,
> >  extern void memcpy_page_flushcache(char *to, struct page *page, size_t 
> > offset,
> >  size_t len);
> >
> > +/*
> > + * When reading an instruction iff it is a prefix, the suffix needs to be 
> > also
> > + * loaded.
> > + */
> > 

Re: [PATCH 05/18] powerpc sstep: Prepare to support prefixed instructions

2019-12-19 Thread Jordan Niethe
On Thu, Dec 19, 2019 at 1:15 AM Daniel Axtens  wrote:
>
> Jordan Niethe  writes:
>
> > Currently all instructions are a single word long. A future ISA version
> > will include prefixed instructions which have a double word length. The
> > functions used for analysing and emulating instructions need to be
> > modified so that they can handle these new instruction types.
> >
> > A prefixed instruction is a word prefix followed by a word suffix. All
> > prefixes uniquely have the primary op-code 1. Suffixes may be valid word
> > instructions or instructions that only exist as suffixes.
> >
> > In handling prefixed instructions it will be convenient to treat the
> > suffix and prefix as separate words. To facilitate this modify
> > analyse_instr() and emulate_step() to take a take a suffix as a
> > parameter. For word instructions it does not matter what is passed in
> > here - it will be ignored.
> >
> > We also define a new flag, PREFIXED, to be used in instruction_op:type.
> > This flag will indicate when emulating an analysed instruction if the
> > NIP should be advanced by word length or double word length.
> >
> > The callers of analyse_instr() and emulate_step() will need their own
> > changes to be able to support prefixed instructions. For now modify them
> > to pass in 0 as a suffix.
> >
> > Note that at this point no prefixed instructions are emulated or
> > analysed - this is just making it possible to do so.
> >
> > Signed-off-by: Jordan Niethe 
> > ---
> >  arch/powerpc/include/asm/ppc-opcode.h |  3 +++
> >  arch/powerpc/include/asm/sstep.h  |  8 +--
> >  arch/powerpc/include/asm/uaccess.h| 30 +++
> >  arch/powerpc/kernel/align.c   |  2 +-
> >  arch/powerpc/kernel/hw_breakpoint.c   |  4 ++--
> >  arch/powerpc/kernel/kprobes.c |  2 +-
> >  arch/powerpc/kernel/mce_power.c   |  2 +-
> >  arch/powerpc/kernel/optprobes.c   |  2 +-
> >  arch/powerpc/kernel/uprobes.c |  2 +-
> >  arch/powerpc/kvm/emulate_loadstore.c  |  2 +-
> >  arch/powerpc/lib/sstep.c  | 12 ++-
> >  arch/powerpc/lib/test_emulate_step.c  | 30 +--
> >  arch/powerpc/xmon/xmon.c  |  4 ++--
> >  13 files changed, 71 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
> > b/arch/powerpc/include/asm/ppc-opcode.h
> > index c1df75edde44..a1dfa4bdd22f 100644
> > --- a/arch/powerpc/include/asm/ppc-opcode.h
> > +++ b/arch/powerpc/include/asm/ppc-opcode.h
> > @@ -377,6 +377,9 @@
> >  #define PPC_INST_VCMPEQUD0x10c7
> >  #define PPC_INST_VCMPEQUB0x1006
> >
> > +/* macro to check if a word is a prefix */
> > +#define IS_PREFIX(x) (((x) >> 26) == 1)
> > +
> >  /* macros to insert fields into opcodes */
> >  #define ___PPC_RA(a) (((a) & 0x1f) << 16)
> >  #define ___PPC_RB(b) (((b) & 0x1f) << 11)
> > diff --git a/arch/powerpc/include/asm/sstep.h 
> > b/arch/powerpc/include/asm/sstep.h
> > index 769f055509c9..6d4cb602e231 100644
> > --- a/arch/powerpc/include/asm/sstep.h
> > +++ b/arch/powerpc/include/asm/sstep.h
> > @@ -89,6 +89,9 @@ enum instruction_type {
> >  #define VSX_LDLEFT   4   /* load VSX register from left */
> >  #define VSX_CHECK_VEC8   /* check MSR_VEC not MSR_VSX for reg 
> > >= 32 */
> >
> > +/* Prefixed flag, ORed in with type */
> > +#define PREFIXED 0x800
> > +
> >  /* Size field in type word */
> >  #define SIZE(n)  ((n) << 12)
> >  #define GETSIZE(w)   ((w) >> 12)
> > @@ -132,7 +135,7 @@ union vsx_reg {
> >   * otherwise.
> >   */
> >  extern int analyse_instr(struct instruction_op *op, const struct pt_regs 
> > *regs,
> > -  unsigned int instr);
> > +  unsigned int instr, unsigned int sufx);
> >
>
> I'm not saying this is necessarily better, but did you consider:
>
>  - making instr 64 bits and using masking and shifting macros to get the
>prefix and suffix?
>
>  - defining an instruction type/struct/union/whatever that contains both
>halves in one object?
>
> I'm happy to be told that it ends up being way, way uglier/worse/etc,
> but I just thought I'd ask.
>
> Regards,
> Daniel

It is a good question and something I thought and am not completely confident
that this approach is the best. Basically what I ended up thinking was that
the prefixed instructions were a bit of a special case, and by doing
it like this
the normal word instructions would just carry on the same as before.

I can see this is a pretty flimsy reason, so I am happy for suggestions as
to what would end up being clearer.


>
> >  /*
> >   * Emulate an instruction that can be executed just by updating
> > @@ -149,7 +152,8 @@ void emulate_update_regs(struct pt_regs *reg, struct 
> > instruction_op *op);
> >   * 0 if it could not be emulated, or -1 for an instruction that
> >   * should not be emulated (rfid, mtmsrd clearing MSR_RI, etc.).
> >   */
> > -extern int emulate_step(struct pt_regs 

Re: [PATCH 05/18] powerpc sstep: Prepare to support prefixed instructions

2019-12-19 Thread Christophe Leroy




Le 20/12/2019 à 06:11, Jordan Niethe a écrit :

On Wed, Dec 18, 2019 at 7:35 PM Daniel Axtens  wrote:


Jordan Niethe  writes:


Currently all instructions are a single word long. A future ISA version
will include prefixed instructions which have a double word length. The
functions used for analysing and emulating instructions need to be
modified so that they can handle these new instruction types.

A prefixed instruction is a word prefix followed by a word suffix. All
prefixes uniquely have the primary op-code 1. Suffixes may be valid word
instructions or instructions that only exist as suffixes.

In handling prefixed instructions it will be convenient to treat the
suffix and prefix as separate words. To facilitate this modify
analyse_instr() and emulate_step() to take a take a suffix as a
parameter. For word instructions it does not matter what is passed in
here - it will be ignored.

We also define a new flag, PREFIXED, to be used in instruction_op:type.
This flag will indicate when emulating an analysed instruction if the
NIP should be advanced by word length or double word length.

The callers of analyse_instr() and emulate_step() will need their own
changes to be able to support prefixed instructions. For now modify them
to pass in 0 as a suffix.

Note that at this point no prefixed instructions are emulated or
analysed - this is just making it possible to do so.

Signed-off-by: Jordan Niethe 
---
  arch/powerpc/include/asm/ppc-opcode.h |  3 +++
  arch/powerpc/include/asm/sstep.h  |  8 +--
  arch/powerpc/include/asm/uaccess.h| 30 +++
  arch/powerpc/kernel/align.c   |  2 +-
  arch/powerpc/kernel/hw_breakpoint.c   |  4 ++--
  arch/powerpc/kernel/kprobes.c |  2 +-
  arch/powerpc/kernel/mce_power.c   |  2 +-
  arch/powerpc/kernel/optprobes.c   |  2 +-
  arch/powerpc/kernel/uprobes.c |  2 +-
  arch/powerpc/kvm/emulate_loadstore.c  |  2 +-
  arch/powerpc/lib/sstep.c  | 12 ++-
  arch/powerpc/lib/test_emulate_step.c  | 30 +--
  arch/powerpc/xmon/xmon.c  |  4 ++--
  13 files changed, 71 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index c1df75edde44..a1dfa4bdd22f 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -377,6 +377,9 @@
  #define PPC_INST_VCMPEQUD0x10c7
  #define PPC_INST_VCMPEQUB0x1006

+/* macro to check if a word is a prefix */
+#define IS_PREFIX(x) (((x) >> 26) == 1)
+
  /* macros to insert fields into opcodes */
  #define ___PPC_RA(a) (((a) & 0x1f) << 16)
  #define ___PPC_RB(b) (((b) & 0x1f) << 11)
diff --git a/arch/powerpc/include/asm/sstep.h b/arch/powerpc/include/asm/sstep.h
index 769f055509c9..6d4cb602e231 100644
--- a/arch/powerpc/include/asm/sstep.h
+++ b/arch/powerpc/include/asm/sstep.h
@@ -89,6 +89,9 @@ enum instruction_type {
  #define VSX_LDLEFT   4   /* load VSX register from left */
  #define VSX_CHECK_VEC8   /* check MSR_VEC not MSR_VSX for reg >= 
32 */

+/* Prefixed flag, ORed in with type */
+#define PREFIXED 0x800
+
  /* Size field in type word */
  #define SIZE(n)  ((n) << 12)
  #define GETSIZE(w)   ((w) >> 12)
@@ -132,7 +135,7 @@ union vsx_reg {
   * otherwise.
   */
  extern int analyse_instr(struct instruction_op *op, const struct pt_regs 
*regs,
-  unsigned int instr);
+  unsigned int instr, unsigned int sufx);

  /*
   * Emulate an instruction that can be executed just by updating
@@ -149,7 +152,8 @@ void emulate_update_regs(struct pt_regs *reg, struct 
instruction_op *op);
   * 0 if it could not be emulated, or -1 for an instruction that
   * should not be emulated (rfid, mtmsrd clearing MSR_RI, etc.).
   */
-extern int emulate_step(struct pt_regs *regs, unsigned int instr);
+extern int emulate_step(struct pt_regs *regs, unsigned int instr,
+ unsigned int sufx);

  /*
   * Emulate a load or store instruction by reading/writing the
diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 15002b51ff18..bc585399e0c7 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -423,4 +423,34 @@ extern long __copy_from_user_flushcache(void *dst, const 
void __user *src,
  extern void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
  size_t len);

+/*
+ * When reading an instruction iff it is a prefix, the suffix needs to be also
+ * loaded.
+ */
+#define __get_user_instr(x, y, ptr)  \
+({   \
+ long __gui_ret = 0; \
+ y = 0;  \
+ __gui_ret = __get_user(x, ptr); \
+ if (!__gui_ret) {   \
+ if 

Re: [PATCH v4 4/9] drm/i915/perf: open access for CAP_SYS_PERFMON privileged process

2019-12-19 Thread Lionel Landwerlin

On 18/12/2019 11:27, Alexey Budankov wrote:

Open access to i915_perf monitoring for CAP_SYS_PERFMON privileged
processes. For backward compatibility reasons access to i915_perf
subsystem remains open for CAP_SYS_ADMIN privileged processes but
CAP_SYS_ADMIN usage for secure i915_perf monitoring is discouraged
with respect to CAP_SYS_PERFMON capability.

Signed-off-by: Alexey Budankov 


Acked-by: Lionel Landwerlin 


---
  drivers/gpu/drm/i915/i915_perf.c | 13 ++---
  1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index e42b86827d6b..e2697f8d04de 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -2748,10 +2748,10 @@ i915_perf_open_ioctl_locked(struct drm_i915_private 
*dev_priv,
/* Similar to perf's kernel.perf_paranoid_cpu sysctl option
 * we check a dev.i915.perf_stream_paranoid sysctl option
 * to determine if it's ok to access system wide OA counters
-* without CAP_SYS_ADMIN privileges.
+* without CAP_SYS_PERFMON or CAP_SYS_ADMIN privileges.
 */
if (privileged_op &&
-   i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+   i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to open system-wide i915 perf 
stream\n");
ret = -EACCES;
goto err_ctx;
@@ -2939,9 +2939,8 @@ static int read_properties_unlocked(struct 
drm_i915_private *dev_priv,
} else
oa_freq_hz = 0;
  
-			if (oa_freq_hz > i915_oa_max_sample_rate &&

-   !capable(CAP_SYS_ADMIN)) {
-   DRM_DEBUG("OA exponent would exceed the max sampling 
frequency (sysctl dev.i915.oa_max_sample_rate) %uHz without root privileges\n",
+   if (oa_freq_hz > i915_oa_max_sample_rate && 
!perfmon_capable()) {
+   DRM_DEBUG("OA exponent would exceed the max sampling 
frequency (sysctl dev.i915.oa_max_sample_rate) %uHz without CAP_SYS_PERFMON or 
CAP_SYS_ADMIN privileges\n",
  i915_oa_max_sample_rate);
return -EACCES;
}
@@ -3328,7 +3327,7 @@ int i915_perf_add_config_ioctl(struct drm_device *dev, 
void *data,
return -EINVAL;
}
  
-	if (i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {

+   if (i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to add i915 OA config\n");
return -EACCES;
}
@@ -3474,7 +3473,7 @@ int i915_perf_remove_config_ioctl(struct drm_device *dev, 
void *data,
return -ENOTSUPP;
}
  
-	if (i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {

+   if (i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to remove i915 OA config\n");
return -EACCES;
}





Re: [PATCH v4 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-19 Thread Daniel Axtens
Christophe Leroy  writes:

> On 12/19/2019 12:36 AM, Daniel Axtens wrote:
>> KASAN support on Book3S is a bit tricky to get right:
>> 
>>   - It would be good to support inline instrumentation so as to be able to
>> catch stack issues that cannot be caught with outline mode.
>> 
>>   - Inline instrumentation requires a fixed offset.
>> 
>>   - Book3S runs code in real mode after booting. Most notably a lot of KVM
>> runs in real mode, and it would be good to be able to instrument it.
>> 
>>   - Because code runs in real mode after boot, the offset has to point to
>> valid memory both in and out of real mode.
>> 
>>  [ppc64 mm note: The kernel installs a linear mapping at effective
>>  address c000... onward. This is a one-to-one mapping with physical
>>  memory from ... onward. Because of how memory accesses work on
>>  powerpc 64-bit Book3S, a kernel pointer in the linear map accesses the
>>  same memory both with translations on (accessing as an 'effective
>>  address'), and with translations off (accessing as a 'real
>>  address'). This works in both guests and the hypervisor. For more
>>  details, see s5.7 of Book III of version 3 of the ISA, in particular
>>  the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this
>>  KASAN implementation currently only supports Radix.]
>> 
>> One approach is just to give up on inline instrumentation. This way all
>> checks can be delayed until after everything set is up correctly, and the
>> address-to-shadow calculations can be overridden. However, the features and
>> speed boost provided by inline instrumentation are worth trying to do
>> better.
>> 
>> If _at compile time_ it is known how much contiguous physical memory a
>> system has, the top 1/8th of the first block of physical memory can be set
>> aside for the shadow. This is a big hammer and comes with 3 big
>> consequences:
>> 
>>   - there's no nice way to handle physically discontiguous memory, so only
>> the first physical memory block can be used.
>> 
>>   - kernels will simply fail to boot on machines with less memory than
>> specified when compiling.
>> 
>>   - kernels running on machines with more memory than specified when
>> compiling will simply ignore the extra memory.
>> 
>> Implement and document KASAN this way. The current implementation is Radix
>> only.
>> 
>> Despite the limitations, it can still find bugs,
>> e.g. http://patchwork.ozlabs.org/patch/1103775/
>> 
>> At the moment, this physical memory limit must be set _even for outline
>> mode_. This may be changed in a later series - a different implementation
>> could be added for outline mode that dynamically allocates shadow at a
>> fixed offset. For example, see https://patchwork.ozlabs.org/patch/795211/
>> 
>> Suggested-by: Michael Ellerman 
>> Cc: Balbir Singh  # ppc64 out-of-line radix version
>> Cc: Christophe Leroy  # ppc32 version
>> Signed-off-by: Daniel Axtens 
>> 
>> ---
>> Changes since v3:
>>   - Address further feedback from Christophe.
>>   - Drop changes to stack walking, it looks like the issue I observed is
>> related to that particular stack, not stack-walking generally.
>> 
>> Changes since v2:
>> 
>>   - Address feedback from Christophe around cleanups and docs.
>>   - Address feedback from Balbir: at this point I don't have a good solution
>> for the issues you identify around the limitations of the inline 
>> implementation
>> but I think that it's worth trying to get the stack instrumentation 
>> support.
>> I'm happy to have an alternative and more flexible outline mode - I had
>> envisoned this would be called 'lightweight' mode as it imposes fewer 
>> restrictions.
>> I've linked to your implementation. I think it's best to add it in a 
>> follow-up series.
>>   - Made the default PHYS_MEM_SIZE_FOR_KASAN value 1024MB. I think most 
>> people have
>> guests with at least that much memory in the Radix 64s case so it's a 
>> much
>> saner default - it means that if you just turn on KASAN without reading 
>> the
>> docs you're much more likely to have a bootable kernel, which you will 
>> never
>> have if the value is set to zero! I'm happy to bikeshed the value if we 
>> want.
>> 
>> Changes since v1:
>>   - Landed kasan vmalloc support upstream
>>   - Lots of feedback from Christophe.
>> 
>> Changes since the rfc:
>> 
>>   - Boots real and virtual hardware, kvm works.
>> 
>>   - disabled reporting when we're checking the stack for exception
>> frames. The behaviour isn't wrong, just incompatible with KASAN.
>> 
>>   - Documentation!
>> 
>>   - Dropped old module stuff in favour of KASAN_VMALLOC.
>> 
>> The bugs with ftrace and kuap were due to kernel bloat pushing
>> prom_init calls to be done via the plt. Because we did not have
>> a relocatable kernel, and they are done very early, this caused
>> everything to explode. Compile with CONFIG_RELOCATABLE!
>> ---
>>   

Re: [PATCH 08/10] crypto/NX: Add NX GZIP user space API

2019-12-19 Thread Haren Myneni
On Tue, 2019-12-17 at 17:33 +0800, Herbert Xu wrote:
> On Sun, Dec 15, 2019 at 05:05:19AM -0800, Haren Myneni wrote:
> > 
> > On power9, userspace can send GZIP compression requests directly to NX
> > once kernel establishes NX channel / window. This patch provides GZIP
> > engine access to user space via /dev/crypto/nx-gzip device node with
> > open, VAS_TX_WIN_OPEN ioctl, mmap and close operations.
> > 
> > Each window corresponds to file descriptor and application can open
> > multiple windows. After the window is opened, mmap() system call to map
> > the hardware address of engine's request queue into the application's
> > virtual address space.
> > 
> > Then the application can then submit one or more requests to the the
> > engine by using the copy/paste instructions and pasting the CRBs to
> > the virtual address (aka paste_address) returned by mmap().
> > 
> > Signed-off-by: Sukadev Bhattiprolu 
> > Signed-off-by: Haren Myneni 
> > ---
> >  drivers/crypto/nx/Makefile|   2 +-
> >  drivers/crypto/nx/nx-842-powernv.h|   2 +
> >  drivers/crypto/nx/nx-commom-powernv.c |  21 ++-
> >  drivers/crypto/nx/nx-gzip-powernv.c   | 282 
> > ++
> >  4 files changed, 304 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/crypto/nx/nx-gzip-powernv.c
> 
> We already have a kernel compress API which could be exposed
> to user-space through af_alg.  If every driver created their
> own user-space API it would be unmanageable.

Thanks. 

Virtual Accelerator Switchboard (VAS) can provide support different
accelerators, Right now only NX is used, but possible to extend to
others in future. Or different functionalities such as fast thread
wakeup (VAS feature) with VAS windows. 

So looking common VAS API for any its accelerators. Need open a window /
channel - open() and ioctl()) calls, and setup the communications with
mapping address to NX (mmap()) and close the window. Then user space
communicates to accelerator directly without kernel involvement.
Specific drivers should set window attributes such as how many requests
can be send at same time and etc. All other interfaces should be same
for any accelerator. 

Also, since user space sends requests directly, should restrict
malicious users to prevent overload NX (security issue). Allowing
sysadmin to restrict /dev/crypto/nx-gzip usage. 


As you suggested, SW crypto API (af_alg) can be used just for NX
compression like using API based on the accelerator functionalities. It
is socket based API with AF_ALG socket family. But is there a way for
sysadmin to restrict usage from user space? Need just few functions in
struct proto. 

static struct proto_ops {
.family = PF_ALG,
.ioctl = nxgzip_ioctl,
.mmap = nxgzip_mmap,
.release = nxgzip_release,
};

Thanks
Haren


> 
> Cheers,




Re: [PATCH v4 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-19 Thread Christophe Leroy




On 12/19/2019 10:05 AM, Christophe Leroy wrote:



Le 19/12/2019 à 10:50, Daniel Axtens a écrit :

Christophe Leroy  writes:


On 12/19/2019 12:36 AM, Daniel Axtens wrote:

KASAN support on Book3S is a bit tricky to get right:

   - It would be good to support inline instrumentation so as to be 
able to

 catch stack issues that cannot be caught with outline mode.

   - Inline instrumentation requires a fixed offset.

   - Book3S runs code in real mode after booting. Most notably a lot 
of KVM
 runs in real mode, and it would be good to be able to 
instrument it.


   - Because code runs in real mode after boot, the offset has to 
point to

 valid memory both in and out of real mode.

  [ppc64 mm note: The kernel installs a linear mapping at effective
  address c000... onward. This is a one-to-one mapping with 
physical
  memory from ... onward. Because of how memory accesses 
work on
  powerpc 64-bit Book3S, a kernel pointer in the linear map 
accesses the

  same memory both with translations on (accessing as an 'effective
  address'), and with translations off (accessing as a 'real
  address'). This works in both guests and the hypervisor. For more
  details, see s5.7 of Book III of version 3 of the ISA, in 
particular
  the Storage Control Overview, s5.7.3, and s5.7.5 - noting that 
this

  KASAN implementation currently only supports Radix.]

One approach is just to give up on inline instrumentation. This way all
checks can be delayed until after everything set is up correctly, 
and the
address-to-shadow calculations can be overridden. However, the 
features and

speed boost provided by inline instrumentation are worth trying to do
better.

If _at compile time_ it is known how much contiguous physical memory a
system has, the top 1/8th of the first block of physical memory can 
be set

aside for the shadow. This is a big hammer and comes with 3 big
consequences:

   - there's no nice way to handle physically discontiguous memory, 
so only

 the first physical memory block can be used.

   - kernels will simply fail to boot on machines with less memory than
 specified when compiling.

   - kernels running on machines with more memory than specified when
 compiling will simply ignore the extra memory.

Implement and document KASAN this way. The current implementation is 
Radix

only.

Despite the limitations, it can still find bugs,
e.g. http://patchwork.ozlabs.org/patch/1103775/

At the moment, this physical memory limit must be set _even for outline
mode_. This may be changed in a later series - a different 
implementation

could be added for outline mode that dynamically allocates shadow at a
fixed offset. For example, see 
https://patchwork.ozlabs.org/patch/795211/


Suggested-by: Michael Ellerman 
Cc: Balbir Singh  # ppc64 out-of-line radix 
version

Cc: Christophe Leroy  # ppc32 version
Signed-off-by: Daniel Axtens 

---
Changes since v3:
   - Address further feedback from Christophe.
   - Drop changes to stack walking, it looks like the issue I 
observed is

 related to that particular stack, not stack-walking generally.

Changes since v2:

   - Address feedback from Christophe around cleanups and docs.
   - Address feedback from Balbir: at this point I don't have a good 
solution
 for the issues you identify around the limitations of the 
inline implementation
 but I think that it's worth trying to get the stack 
instrumentation support.
 I'm happy to have an alternative and more flexible outline mode 
- I had
 envisoned this would be called 'lightweight' mode as it imposes 
fewer restrictions.
 I've linked to your implementation. I think it's best to add it 
in a follow-up series.
   - Made the default PHYS_MEM_SIZE_FOR_KASAN value 1024MB. I think 
most people have
 guests with at least that much memory in the Radix 64s case so 
it's a much
 saner default - it means that if you just turn on KASAN without 
reading the
 docs you're much more likely to have a bootable kernel, which 
you will never
 have if the value is set to zero! I'm happy to bikeshed the 
value if we want.


Changes since v1:
   - Landed kasan vmalloc support upstream
   - Lots of feedback from Christophe.

Changes since the rfc:

   - Boots real and virtual hardware, kvm works.

   - disabled reporting when we're checking the stack for exception
 frames. The behaviour isn't wrong, just incompatible with KASAN.

   - Documentation!

   - Dropped old module stuff in favour of KASAN_VMALLOC.

The bugs with ftrace and kuap were due to kernel bloat pushing
prom_init calls to be done via the plt. Because we did not have
a relocatable kernel, and they are done very early, this caused
everything to explode. Compile with CONFIG_RELOCATABLE!
---
   Documentation/dev-tools/kasan.rst    |   8 +-
   Documentation/powerpc/kasan.txt  | 112 
++-

   arch/powerpc/Kconfig

Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-12-19 Thread Jason A. Donenfeld

Hi folks,

I'm actually still experiencing this sporadically in the WireGuard test 
suite, which you can see being run on https://build.wireguard.com/ . 
About 50% of the time the powerpc64 build will fail at a place like this:


[   65.147823] Oops: Exception in kernel mode, sig: 4 [#1]
[   65.149198] LE PAGE_SIZE=4K MMU=Hash PREEMPT SMP NR_CPUS=4 NUMA pSeries
[   65.149595] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc1+ #1
[   65.149745] NIP:  c000 LR: c007eda0 CTR: 
c007ed80

[   65.149934] REGS: c0a47970 TRAP: 0700   Not tainted  (5.5.0-rc1+)
[   65.150032] MSR:  8288b033  
CR: 48008288  XER: 

[   65.150352] CFAR: c00332bc IRQMASK: 1
[   65.150352] GPR00: c0036508 c0a47c00 c0a4c100 
0001
[   65.150352] GPR04: c0a50998  c0a50908 
0f509000
[   65.150352] GPR08: 2800   
cff24f00
[   65.150352] GPR12: c007ed80 c0ad9000  

[   65.150352] GPR16: 008c9190 008c94a8 008c92f8 
008c98b0
[   65.150352] GPR20: 008f2f88 fffd 0014 
00e6c100
[   65.150352] GPR24: 00e6c100 0001  
c0a50998
[   65.150352] GPR28: c0a9e280 c0a50aa4 0002 


[   65.151591] NIP [c000] doorbell_try_core_ipi+0xd0/0xf0
[   65.151750] LR [c007eda0] smp_pseries_cause_ipi+0x20/0x70
[   65.151913] Call Trace:
[   65.152109] [c0a47c00] [c00c7c9c] 
_nohz_idle_balance+0xbc/0x300 (unreliable)
[   65.152370] [c0a47c30] [c0036508] 
smp_send_reschedule+0x98/0xb0

[   65.152711] [c0a47c50] [c00c1634] kick_ilb+0x114/0x140
[   65.152962] [c0a47ca0] [c00c86d8] 
newidle_balance+0x4e8/0x500
[   65.153213] [c0a47d20] [c00c8788] 
pick_next_task_fair+0x48/0x3a0

[   65.153424] [c0a47d80] [c0466620] __schedule+0xf0/0x430
[   65.153612] [c0a47de0] [c0466b04] schedule_idle+0x34/0x70
[   65.153786] [c0a47e10] [c00c0bc8] do_idle+0x1a8/0x220
[   65.154121] [c0a47e70] [c00c0e94] 
cpu_startup_entry+0x34/0x40

[   65.154313] [c0a47ea0] [c000ef1c] rest_init+0x10c/0x124
[   65.154414] [c0a47ee0] [c054] 
start_kernel+0x568/0x594
[   65.154585] [c0a47f90] [c000a7cc] 
start_here_common+0x1c/0x330

[   65.154854] Instruction dump:
[   65.155191] 38210030 e8010010 7c0803a6 4e800020 3d220004 39295228 
8129 3929
[   65.155498] 7d284038 7c0004ac 5508017e 65082800 <7c00411c> e94d0178 
812a 3929

[   65.156155] ---[ end trace 6180d12e268ffdaf ]---
[   65.185452]
[   66.187490] Kernel panic - not syncing: Fatal exception

This is with "qemu-system-ppc64 -smp 4 -machine pseries" on QEMU 4.0.0.

I'm not totally sure what's going on here. I'm emulating a pseries, and 
using that with qemu's pseries model, and I see that selecting the 
pseries forces the selection of 'config PPC_DOORBELL' (twice in the same 
section, actually). Then inside the kernel there appears to be some 
runtime CPU check for doorbell support. Is this a case in which QEMU is 
advertising doorbell support that TCG doesn't have? Or is something else 
happening here?


Thanks,
Jason


Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-12-19 Thread Sebastian Andrzej Siewior
On 2019-12-19 11:41:21 [+0100], Jason A. Donenfeld wrote:
> Hi folks,
Hi,

so this should duct tape it:

diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c
index f17ff1200eaae..ec044bdf362a1 100644
--- a/arch/powerpc/kernel/dbell.c
+++ b/arch/powerpc/kernel/dbell.c
@@ -60,16 +60,8 @@ void doorbell_core_ipi(int cpu)
  */
 int doorbell_try_core_ipi(int cpu)
 {
-   int this_cpu = get_cpu();
int ret = 0;
 
-   if (cpumask_test_cpu(cpu, cpu_sibling_mask(this_cpu))) {
-   doorbell_core_ipi(cpu);
-   ret = 1;
-   }
-
-   put_cpu();
-
return ret;
 }
 

> This is with "qemu-system-ppc64 -smp 4 -machine pseries" on QEMU 4.0.0.

Interesting. I didn't get v5.4 to boot a while ago and didn't have the
time to look into it.

> I'm not totally sure what's going on here. I'm emulating a pseries, and
> using that with qemu's pseries model, and I see that selecting the pseries
> forces the selection of 'config PPC_DOORBELL' (twice in the same section,
> actually). Then inside the kernel there appears to be some runtime CPU check
> for doorbell support. Is this a case in which QEMU is advertising doorbell
> support that TCG doesn't have? Or is something else happening here?

Based on my understanding is that the doorbell feature is part of the
architecture. It can be used to signal other siblings on the same CPU.
qemu TCG doesn't support that and does not allow to announce multiple
siblings on the same CPU. However, the kernel uses this interface if it
tries to send an interrupt to itself (the same CPU) so everything
matches.
Last time I run into this, the interface was change so the kernel das
not send an IPI to itself. This changed now for another function…

> Thanks,
> Jason

Sebastian


Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-12-19 Thread Jason A. Donenfeld
On Thu, Dec 19, 2019 at 12:13 PM Sebastian Andrzej Siewior
 wrote:
> Based on my understanding is that the doorbell feature is part of the
> architecture. It can be used to signal other siblings on the same CPU.
> qemu TCG doesn't support that and does not allow to announce multiple
> siblings on the same CPU. However, the kernel uses this interface if it
> tries to send an interrupt to itself (the same CPU) so everything
> matches.
> Last time I run into this, the interface was change so the kernel das
> not send an IPI to itself. This changed now for another function…

One way of fixing this is to just "not use the feature", as you seem
to be suggesting.

But actually shouldn't there be some CPU feature detection available?
Something like -- QEMU advertises to the kernel that it supports or
doesn't support doorbells, and the kernel then avoids those paths if
the CPU feature flag isn't present?

Jason


[PATCH V2] powerpc/setup_64: use DEFINE_DEBUGFS_ATTRIBUTE to define fops_rfi_flush

2019-12-19 Thread Chen Zhou
Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE for
debugfs files.

In order to protect against file removal races, debugfs files created via
debugfs_create_file() are wrapped by a struct file_operations at their
opening.

If the original struct file_operations is known to be safe against removal
races already, the proxy creation may be bypassed by creating the files
using DEFINE_DEBUGFS_ATTRIBUTE() and debugfs_create_file_unsafe().

Signed-off-by: Chen Zhou 
---
Changes since v1:
- modify commit message.

 arch/powerpc/kernel/setup_64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 6104917..4b9fbb2 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -956,11 +956,11 @@ static int rfi_flush_get(void *data, u64 *val)
return 0;
 }
 
-DEFINE_SIMPLE_ATTRIBUTE(fops_rfi_flush, rfi_flush_get, rfi_flush_set, 
"%llu\n");
+DEFINE_DEBUGFS_ATTRIBUTE(fops_rfi_flush, rfi_flush_get, rfi_flush_set, 
"%llu\n");
 
 static __init int rfi_flush_debugfs_init(void)
 {
-   debugfs_create_file("rfi_flush", 0600, powerpc_debugfs_root, NULL, 
_rfi_flush);
+   debugfs_create_file_unsafe("rfi_flush", 0600, powerpc_debugfs_root, 
NULL, _rfi_flush);
return 0;
 }
 device_initcall(rfi_flush_debugfs_init);
-- 
2.7.4



Re: [PATCH v4 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-19 Thread Christophe Leroy




Le 19/12/2019 à 10:50, Daniel Axtens a écrit :

Christophe Leroy  writes:


On 12/19/2019 12:36 AM, Daniel Axtens wrote:

KASAN support on Book3S is a bit tricky to get right:

   - It would be good to support inline instrumentation so as to be able to
 catch stack issues that cannot be caught with outline mode.

   - Inline instrumentation requires a fixed offset.

   - Book3S runs code in real mode after booting. Most notably a lot of KVM
 runs in real mode, and it would be good to be able to instrument it.

   - Because code runs in real mode after boot, the offset has to point to
 valid memory both in and out of real mode.

  [ppc64 mm note: The kernel installs a linear mapping at effective
  address c000... onward. This is a one-to-one mapping with physical
  memory from ... onward. Because of how memory accesses work on
  powerpc 64-bit Book3S, a kernel pointer in the linear map accesses the
  same memory both with translations on (accessing as an 'effective
  address'), and with translations off (accessing as a 'real
  address'). This works in both guests and the hypervisor. For more
  details, see s5.7 of Book III of version 3 of the ISA, in particular
  the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this
  KASAN implementation currently only supports Radix.]

One approach is just to give up on inline instrumentation. This way all
checks can be delayed until after everything set is up correctly, and the
address-to-shadow calculations can be overridden. However, the features and
speed boost provided by inline instrumentation are worth trying to do
better.

If _at compile time_ it is known how much contiguous physical memory a
system has, the top 1/8th of the first block of physical memory can be set
aside for the shadow. This is a big hammer and comes with 3 big
consequences:

   - there's no nice way to handle physically discontiguous memory, so only
 the first physical memory block can be used.

   - kernels will simply fail to boot on machines with less memory than
 specified when compiling.

   - kernels running on machines with more memory than specified when
 compiling will simply ignore the extra memory.

Implement and document KASAN this way. The current implementation is Radix
only.

Despite the limitations, it can still find bugs,
e.g. http://patchwork.ozlabs.org/patch/1103775/

At the moment, this physical memory limit must be set _even for outline
mode_. This may be changed in a later series - a different implementation
could be added for outline mode that dynamically allocates shadow at a
fixed offset. For example, see https://patchwork.ozlabs.org/patch/795211/

Suggested-by: Michael Ellerman 
Cc: Balbir Singh  # ppc64 out-of-line radix version
Cc: Christophe Leroy  # ppc32 version
Signed-off-by: Daniel Axtens 

---
Changes since v3:
   - Address further feedback from Christophe.
   - Drop changes to stack walking, it looks like the issue I observed is
 related to that particular stack, not stack-walking generally.

Changes since v2:

   - Address feedback from Christophe around cleanups and docs.
   - Address feedback from Balbir: at this point I don't have a good solution
 for the issues you identify around the limitations of the inline 
implementation
 but I think that it's worth trying to get the stack instrumentation 
support.
 I'm happy to have an alternative and more flexible outline mode - I had
 envisoned this would be called 'lightweight' mode as it imposes fewer 
restrictions.
 I've linked to your implementation. I think it's best to add it in a 
follow-up series.
   - Made the default PHYS_MEM_SIZE_FOR_KASAN value 1024MB. I think most people 
have
 guests with at least that much memory in the Radix 64s case so it's a much
 saner default - it means that if you just turn on KASAN without reading the
 docs you're much more likely to have a bootable kernel, which you will 
never
 have if the value is set to zero! I'm happy to bikeshed the value if we 
want.

Changes since v1:
   - Landed kasan vmalloc support upstream
   - Lots of feedback from Christophe.

Changes since the rfc:

   - Boots real and virtual hardware, kvm works.

   - disabled reporting when we're checking the stack for exception
 frames. The behaviour isn't wrong, just incompatible with KASAN.

   - Documentation!

   - Dropped old module stuff in favour of KASAN_VMALLOC.

The bugs with ftrace and kuap were due to kernel bloat pushing
prom_init calls to be done via the plt. Because we did not have
a relocatable kernel, and they are done very early, this caused
everything to explode. Compile with CONFIG_RELOCATABLE!
---
   Documentation/dev-tools/kasan.rst|   8 +-
   Documentation/powerpc/kasan.txt  | 112 ++-
   arch/powerpc/Kconfig |   2 +
   arch/powerpc/Kconfig.debug   |  21 
   

Re: [PATCH 1/2] powerpc/pseries/svm: Don't access some SPRs

2019-12-19 Thread Michael Ellerman
Sukadev Bhattiprolu  writes:
> Michael Ellerman [m...@ellerman.id.au] wrote:
>> 
>> eg. here.
>> 
>> This is the fast path of context switch.
>> 
>> That expands to:
>> 
>>  if (!(mfmsr() & MSR_S))
>>  asm volatile("mfspr %0, SPRN_BESCR" : "=r" (rval));
>>  if (!(mfmsr() & MSR_S))
>>  asm volatile("mfspr %0, SPRN_EBBHR" : "=r" (rval));
>>  if (!(mfmsr() & MSR_S))
>>  asm volatile("mfspr %0, SPRN_EBBRR" : "=r" (rval));
>> 
>
> Yes, should have optimized this at least :-)
>> 
>> If the Ultravisor is going to disable EBB and BHRB then we need new
>> CPU_FTR bits for those, and the code that accesses those registers
>> needs to be put behind cpu_has_feature(EBB) etc.
>
> Will try the cpu_has_feature(). Would it be ok to use a single feature
> bit, like UV or make it per-register group as that could need more
> feature bits?

We already have a number of places using is_secure_guest():

  arch/powerpc/include/asm/mem_encrypt.h: return is_secure_guest();
  arch/powerpc/include/asm/mem_encrypt.h: return is_secure_guest();
  arch/powerpc/include/asm/svm.h:#define get_dtl_cache_ctor() 
(is_secure_guest() ? dtl_cache_ctor : NULL)
  arch/powerpc/kernel/machine_kexec_64.c: if (is_secure_guest() && 
!(image->preserve_context ||
  arch/powerpc/kernel/paca.c: if (is_secure_guest())
  arch/powerpc/kernel/sysfs.c:return sprintf(buf, "%u\n", 
is_secure_guest());
  arch/powerpc/platforms/pseries/iommu.c: if (!is_secure_guest())
  arch/powerpc/platforms/pseries/smp.c:   if (cpu_has_feature(CPU_FTR_DBELL) && 
!is_secure_guest())
  arch/powerpc/platforms/pseries/svm.c:   if (!is_secure_guest())


Which could all (or mostly) be converted to use a cpu_has_feature(CPU_FTR_SVM).

So yeah I guess it makes sense to do that, create a CPU_FTR_SVM and set
it early in boot based on MSR_S.

You could argue it's a firmware feature, so should be FW_FEATURE_SVM,
but we don't use jump_labels for firmware features so they're not as
nice for hot-path code like register switching. Also the distinction
between CPU and firmware features is a bit arbitrary.

cheers


Re: [PATCH v4 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-19 Thread Michael Ellerman
Daniel Axtens  writes:
> Christophe Leroy  writes:
>> On 12/19/2019 12:36 AM, Daniel Axtens wrote:
>>> KASAN support on Book3S is a bit tricky to get right:
...
>>> diff --git a/arch/powerpc/include/asm/kasan.h 
>>> b/arch/powerpc/include/asm/kasan.h
>>> index 296e51c2f066..f18268cbdc33 100644
>>> --- a/arch/powerpc/include/asm/kasan.h
>>> +++ b/arch/powerpc/include/asm/kasan.h
>>> @@ -2,6 +2,9 @@
>>>   #ifndef __ASM_KASAN_H
>>>   #define __ASM_KASAN_H
>>>   
>>> +#include 
>>> +#include 
>>
>> What do you need asm/pgtable.h for ?
>>
>> Build failure due to circular inclusion of asm/pgtable.h:
>
> I see there's a lot of ppc32 stuff, I clearly need to bite the bullet
> and get a ppc32 toolchain so I can squash these without chewing up any
> more of your time. I'll sort that out and send a new spin.

I think you run Ubuntu, in which case it should just be:

$ apt install gcc-powerpc-linux-gnu

cheers


Re: [PATCH v15 06/23] selftests/vm/pkeys: Typecast the pkey register

2019-12-19 Thread Michael Ellerman
Dave Hansen  writes:
> On 12/18/19 12:59 PM, Michal Suchánek wrote:
>>> I'd really just rather do %016lx *everywhere* than sprinkle the
>>> PKEY_REG_FMTs around.
>> Does lx work with u32 without warnings?
>
> Either way, I'd be happy to just make the x86 one u64 to make the whole
> thing look more sane,

It's userspace so you don't get u64, you only get __u64.

And then you'll hit the fact that by default __u64 is unsigned long on
powerpc and unsigned long long on x86, meaning you still can't use the
same printf specifier.

To avoid that you should define __SANE_USERSPACE_TYPES__ before
including any headers, and then you'll get unsigned long long for __u64
everywhere and you can just use %llx.

cheers


Re: [PATCH v17 06/23] powerpc: mm: Add p?d_leaf() definitions

2019-12-19 Thread Michael Ellerman
Steven Price  writes:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information is provided by the
> p?d_leaf() functions/macros.
>
> For powerpc p?d_is_leaf() functions already exist. Export them using the
> new p?d_leaf() name.
>
> CC: Benjamin Herrenschmidt 
> CC: Paul Mackerras 
> CC: Michael Ellerman 
> CC: linuxppc-dev@lists.ozlabs.org
> CC: kvm-...@vger.kernel.org
> Signed-off-by: Steven Price 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +++
>  1 file changed, 3 insertions(+)

We have fallback versions of our pmd_is_leaf() etc. in
arch/powerpc/include/asm/pgtable.h, eg:

#ifndef pmd_is_leaf
#define pmd_is_leaf pmd_is_leaf
static inline bool pmd_is_leaf(pmd_t pmd)
{
return false;
}
#endif

Because we support several different MMUs and most of them don't need to
do anything.

So we could put the compatibility #defines to your names along with the
fallback versions in asm/pgtable.h, rather than in
asm/book3s/64/pgtable.h

But I see you also have fallbacks for your versions, so it probably
doesn't matter either way.

So I'm OK with this version, unless you think there's a compelling
reason to do the compatibility #defines in our asm/pgtable.h

Acked-by: Michael Ellerman 

cheers


> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index b01624e5c467..201a69e6a355 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1355,18 +1355,21 @@ static inline bool is_pte_rw_upgrade(unsigned long 
> old_val, unsigned long new_va
>   * Like pmd_huge() and pmd_large(), but works regardless of config options
>   */
>  #define pmd_is_leaf pmd_is_leaf
> +#define pmd_leaf pmd_is_leaf
>  static inline bool pmd_is_leaf(pmd_t pmd)
>  {
>   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
>  }
>  
>  #define pud_is_leaf pud_is_leaf
> +#define pud_leaf pud_is_leaf
>  static inline bool pud_is_leaf(pud_t pud)
>  {
>   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
>  }
>  
>  #define pgd_is_leaf pgd_is_leaf
> +#define pgd_leaf pgd_is_leaf
>  static inline bool pgd_is_leaf(pgd_t pgd)
>  {
>   return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
> -- 
> 2.20.1


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-19 Thread Will Deacon
On Tue, Dec 17, 2019 at 10:32:35AM -0800, Linus Torvalds wrote:
> On Tue, Dec 17, 2019 at 10:04 AM Linus Torvalds
>  wrote:
> >
> > Let me think about it.
> 
> How about we just get rid of the union entirely, and just use
> 'unsigned long' or 'unsigned long long' depending on the size.
> 
> Something like the attached patch - it still requires that it be an
> arithmetic type, but now because of the final cast.
> 
> But it might still be a cast to a volatile type, of course. Then the
> result will be volatile, but at least now READ_ONCE() won't be taking
> the address of a volatile variable on the stack - does that at least
> fix some of the horrible code generation. Hmm?

Sounds like it according to mpe, but I'll confirm too for arm64.

> This is untested, because I obviously still have the cases of
> structures (page table entries) being accessed once..
> 
>   Linus

>  include/linux/compiler.h | 33 +
>  1 file changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 5e88e7e33abe..8b4282194f16 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -179,18 +179,18 @@ void ftrace_likely_update(struct ftrace_likely_data *f, 
> int val,
>  
>  #include 
>  
> -#define __READ_ONCE_SIZE \
> -({   \
> - switch (size) { \
> - case 1: *(__u8 *)res = *(volatile __u8 *)p; break;  \
> - case 2: *(__u16 *)res = *(volatile __u16 *)p; break;\
> - case 4: *(__u32 *)res = *(volatile __u32 *)p; break;\
> - case 8: *(__u64 *)res = *(volatile __u64 *)p; break;\
> - default:\
> - barrier();  \
> - __builtin_memcpy((void *)res, (const void *)p, size);   \
> - barrier();  \
> - }   \
> +/* "unsigned long" or "unsigned long long" - make it fit in a register if 
> possible */
> +#define __READ_ONCE_TYPE(size) \
> + __typeof__(__builtin_choose_expr(size > sizeof(0UL), 0ULL, 0UL))

Ha, I wondered when '__builtin_choose_expr()' would make an appearance in
this thread! Nice trick. I'll try integrating this with what I have and see
what I run into next.

Back down the rabbit hole...

Will


Re: [PATCH v4 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-19 Thread Daniel Axtens
Christophe Leroy  writes:

> Le 19/12/2019 à 10:50, Daniel Axtens a écrit :
>> Christophe Leroy  writes:
>> 
>>> On 12/19/2019 12:36 AM, Daniel Axtens wrote:
 KASAN support on Book3S is a bit tricky to get right:

- It would be good to support inline instrumentation so as to be able to
  catch stack issues that cannot be caught with outline mode.

- Inline instrumentation requires a fixed offset.

- Book3S runs code in real mode after booting. Most notably a lot of KVM
  runs in real mode, and it would be good to be able to instrument it.

- Because code runs in real mode after boot, the offset has to point to
  valid memory both in and out of real mode.

   [ppc64 mm note: The kernel installs a linear mapping at effective
   address c000... onward. This is a one-to-one mapping with physical
   memory from ... onward. Because of how memory accesses work on
   powerpc 64-bit Book3S, a kernel pointer in the linear map accesses 
 the
   same memory both with translations on (accessing as an 'effective
   address'), and with translations off (accessing as a 'real
   address'). This works in both guests and the hypervisor. For more
   details, see s5.7 of Book III of version 3 of the ISA, in particular
   the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this
   KASAN implementation currently only supports Radix.]

 One approach is just to give up on inline instrumentation. This way all
 checks can be delayed until after everything set is up correctly, and the
 address-to-shadow calculations can be overridden. However, the features and
 speed boost provided by inline instrumentation are worth trying to do
 better.

 If _at compile time_ it is known how much contiguous physical memory a
 system has, the top 1/8th of the first block of physical memory can be set
 aside for the shadow. This is a big hammer and comes with 3 big
 consequences:

- there's no nice way to handle physically discontiguous memory, so only
  the first physical memory block can be used.

- kernels will simply fail to boot on machines with less memory than
  specified when compiling.

- kernels running on machines with more memory than specified when
  compiling will simply ignore the extra memory.

 Implement and document KASAN this way. The current implementation is Radix
 only.

 Despite the limitations, it can still find bugs,
 e.g. http://patchwork.ozlabs.org/patch/1103775/

 At the moment, this physical memory limit must be set _even for outline
 mode_. This may be changed in a later series - a different implementation
 could be added for outline mode that dynamically allocates shadow at a
 fixed offset. For example, see https://patchwork.ozlabs.org/patch/795211/

 Suggested-by: Michael Ellerman 
 Cc: Balbir Singh  # ppc64 out-of-line radix version
 Cc: Christophe Leroy  # ppc32 version
 Signed-off-by: Daniel Axtens 

 ---
 Changes since v3:
- Address further feedback from Christophe.
- Drop changes to stack walking, it looks like the issue I observed is
  related to that particular stack, not stack-walking generally.

 Changes since v2:

- Address feedback from Christophe around cleanups and docs.
- Address feedback from Balbir: at this point I don't have a good 
 solution
  for the issues you identify around the limitations of the inline 
 implementation
  but I think that it's worth trying to get the stack instrumentation 
 support.
  I'm happy to have an alternative and more flexible outline mode - I 
 had
  envisoned this would be called 'lightweight' mode as it imposes fewer 
 restrictions.
  I've linked to your implementation. I think it's best to add it in a 
 follow-up series.
- Made the default PHYS_MEM_SIZE_FOR_KASAN value 1024MB. I think most 
 people have
  guests with at least that much memory in the Radix 64s case so it's a 
 much
  saner default - it means that if you just turn on KASAN without 
 reading the
  docs you're much more likely to have a bootable kernel, which you 
 will never
  have if the value is set to zero! I'm happy to bikeshed the value if 
 we want.

 Changes since v1:
- Landed kasan vmalloc support upstream
- Lots of feedback from Christophe.

 Changes since the rfc:

- Boots real and virtual hardware, kvm works.

- disabled reporting when we're checking the stack for exception
  frames. The behaviour isn't wrong, just incompatible with KASAN.

- Documentation!

- Dropped old module