Re: [PATCH] powerpc/eeh: Handle hugepages in ioremap space

2019-07-14 Thread Michael Ellerman
On Wed, 2019-07-10 at 15:05:17 UTC, Oliver O'Halloran wrote:
> In commit 4a7b06c157a2 ("powerpc/eeh: Handle hugepages in ioremap
> space") support for using hugepages in the vmalloc and ioremap areas was
> enabled for radix. Unfortunately this broke EEH MMIO error checking.
> 
> Detection works by inserting a hook which checks the results of the
> ioreadXX() set of functions.  When a read returns a 0xFFs response we
> need to check for an error which we do by mapping the (virtual) MMIO
> address back to a physical address, then mapping physical address to a
> PCI device via an interval tree.
> 
> When translating virt -> phys we currently assume the ioremap space is
> only populated by PAGE_SIZE mappings. If a hugepage mapping is found we
> emit a WARN_ON(), but otherwise handles the check as though a normal
> page was found. In pathalogical cases such as copying a buffer
> containing a lot of 0xFFs from BAR memory this can result in the system
> not booting because it's too busy printing WARN_ON()s.
> 
> There's no real reason to assume huge pages can't be present and we're
> prefectly capable of handling them, so do that.
> 
> Cc: Nicholas Piggin 
> Fixes: 4a7b06c157a2 ("powerpc/eeh: Handle hugepages in ioremap space")
> Reported-by: Sachin Sant 
> Signed-off-by: Oliver O'Halloran 
> Tested-by: Sachin Sant 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/33439620680be5225c1b8806579a291e0d761ca0

cheers


Re: [PATCH] powerpc/powernv/idle: Fix restore of SPRN_LDBAR for POWER9 stop state.

2019-07-14 Thread Michael Ellerman
On Tue, 2019-07-02 at 10:58:36 UTC, Madhavan Srinivasan wrote:
> From: Athira Rajeev 
> 
> commit 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
> reimplemented book3S code to pltform/powernv/idle.c. But when doing so
> missed to add the per-thread LDBAR update in the core_woken path of
> the power9_idle_stop(). Patch fixes the same.
> 
> Fixes: 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
> Signed-off-by: Athira Rajeev 
> Signed-off-by: Madhavan Srinivasan 
> Reviewed-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/f5a9e488d62360c91c5770bd55a0b40e419a71ce

cheers


Re: [PATCH v4 3/8] KVM: PPC: Ultravisor: Add generic ultravisor call handler

2019-07-14 Thread Michael Ellerman
Claudio Carvalho  writes:
> On 7/11/19 9:57 AM, Michael Ellerman wrote:
>> Claudio Carvalho  writes:
>>> From: Ram Pai 
>>>
>>> Add the ucall() function, which can be used to make ultravisor calls
>>> with varied number of in and out arguments. Ultravisor calls can be made
>>> from the host or guests.
>>>
>>> This copies the implementation of plpar_hcall().
>> .. with quite a few changes?
>>
>> This is one of the things I'd like to see in a Documentation file, so
>> that people can review the implementation vs the specification.
>
> I will document this (and other things) in a file under Documentation/powerpc.

Thanks.

>>> +/* API functions */
>>> +#define UCALL_BUFSIZE 4
>> Please don't copy this design from the hcall code, it has led to bugs in
>> the past.
>>
>> See my (still unmerged) attempt to fix this for the hcall case:
>>   https://patchwork.ozlabs.org/patch/683577/
>>
>> Basically instead of asking callers nicely to define a certain sized
>> buffer, and them forgetting, define a proper type that has the right size.
>
> I will keep that in mind. For now I think we don't need that since the v5
> will have ucall_norets() instead.

OK.

>>> diff --git a/arch/powerpc/kernel/ultravisor.c 
>>> b/arch/powerpc/kernel/ultravisor.c
>>> index dc6021f63c97..02ddf79a9522 100644
>>> --- a/arch/powerpc/kernel/ultravisor.c
>>> +++ b/arch/powerpc/kernel/ultravisor.c
>>> @@ -8,10 +8,14 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  #include 
>>>  #include 
>>>  
>>> +/* in ucall.S */
>>> +EXPORT_SYMBOL_GPL(ucall);
>> This should be in ucall.S
>
> Here I'm following the same way that hypercall wrapper symbols are exported.

Yeah we used to not be able to export from .S but that was fixed a few
years ago.

> Last time I tried to export that in ucall.S the linker complained that the
> ucall
> symbol will not be versioned. Something like this:
> https://patchwork.kernel.org/patch/9452759/

I think you just need to include  in the .S ?

cheers


Re: [PATCH v4 2/8] powerpc: Introduce FW_FEATURE_ULTRAVISOR

2019-07-14 Thread Michael Ellerman
Claudio Carvalho  writes:
> On 7/11/19 9:57 AM, Michael Ellerman wrote:
>> Claudio Carvalho  writes:
>>> diff --git a/arch/powerpc/include/asm/ultravisor.h 
>>> b/arch/powerpc/include/asm/ultravisor.h
>>> new file mode 100644
>>> index ..e5009b0d84ea
>>> --- /dev/null
>>> +++ b/arch/powerpc/include/asm/ultravisor.h
>>> @@ -0,0 +1,15 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * Ultravisor definitions
>>> + *
>>> + * Copyright 2019, IBM Corporation.
>>> + *
>>> + */
>>> +#ifndef _ASM_POWERPC_ULTRAVISOR_H
>>> +#define _ASM_POWERPC_ULTRAVISOR_H
>>> +
>>> +/* Internal functions */
>>> +extern int early_init_dt_scan_ultravisor(unsigned long node, const char 
>>> *uname,
>>> +int depth, void *data);
>> Please don't use extern in new headers.
>>
>>> diff --git a/arch/powerpc/kernel/ultravisor.c 
>>> b/arch/powerpc/kernel/ultravisor.c
>>> new file mode 100644
>>> index ..dc6021f63c97
>>> --- /dev/null
>>> +++ b/arch/powerpc/kernel/ultravisor.c
>> Is there a reason this (and other later files) aren't in platforms/powernv ?
>
> Yes, there is.
> https://www.spinics.net/lists/kvm-ppc/msg14998.html
>
> We also need to do ucalls from a secure guest and its kernel may not
> have CONFIG_PPC_POWERNV=y. I can make it clear in the commit message.

OK, sorry I missed Paul's comment. Yeah that is obviously a valid reason :)

>>> @@ -0,0 +1,24 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Ultravisor high level interfaces
>>> + *
>>> + * Copyright 2019, IBM Corporation.
>>> + *
>>> + */
>>> +#include 
>>> +#include 
>>> +#include 
>>> +
>>> +#include 
>>> +#include 
>>> +
>>> +int __init early_init_dt_scan_ultravisor(unsigned long node, const char 
>>> *uname,
>>> +int depth, void *data)
>>> +{
>>> +   if (depth != 1 || strcmp(uname, "ibm,ultravisor") != 0)
>>> +   return 0;
>> I know you're following the example of OPAL, but this is not the best
>> way to search for the ultravisor node.
>>
>> It makes the location and name of the node part of the ABI, when there's
>> no need for it to be.
>>
>> If instead you just scan the tree for a node that is *compatible* with
>> "ibm,ultravisor" (or whatever compatible string) then the node can be
>> placed any where in the tree and have any name, which gives us the most
>> flexibility in future to change the location of the device tree node.
>
> I will do that in the next version.

Excellent, thanks.

cheers


Re: [RFC PATCH kernel] powerpc/xive: Drop deregistered irqs

2019-07-14 Thread Alexey Kardashevskiy




On 15/07/2019 05:14, Daniel Henrique Barboza wrote:

This patch fixed an issue I was experiencing with virsh start/destroy
of guests with mlx5 and GPU passthrough in a Power 9 server. I
believe it's a similar situation which Alexey described in the post
commit msg.


Tested-by: Daniel Henrique Barboza 



Thanks for testing! Unfortunately, this is not the right fix as it only 
reduces the race window but does not eliminate it, we need something 
better than this, i.e. do not backport this anywhere please :) Thanks,






On 7/12/19 5:20 AM, Alexey Kardashevskiy wrote:

There is a race between releasing an irq on one cpu and fetching it
from XIVE on another cpu as there does not seem to be any locking between
these, probably because xive_irq_chip::irq_shutdown() is supposed to
remove the irq from all queues in the system which it does not do.

As a result, when such released irq appears in a queue, we take it
from the queue but we do not change the current priority on that cpu and
since there is no handler for the irq, EOI is never called and the cpu
current priority remains elevated (7 vs. 0xff==unmasked). If another irq
is assigned to the same cpu, then that device stops working until irq
is moved to another cpu or the device is reset.

This checks if irq is still registered, if not, it assumes no valid irq
was fetched from the loop and if there is none left, it continues to
the irq==0 case (not visible in this patch) and sets priority to 0xff
which is basically unmasking. This calls irq_to_desc() on a hot path now
which is a radix tree lookup; hopefully this won't be noticeable as
that tree is quite small.

Signed-off-by: Alexey Kardashevskiy
---

Found it on P9 system with:
- a host with 8 cpus online
- a boot disk on ahci with its msix on cpu#0
- a guest with 2xGPUs + 6xNVLink + 4 cpus
- GPU#0 from the guest is bound to the same cpu#0.

Killing a guest killed ahci and therefore the host because of the race.
Note that VFIO masks interrupts first and only then resets the device.

Alternatives:

1. Fix xive_irq_chip::irq_shutdown() to walk through all cpu queues and
drop deregistered irqs.

2. Exploit chip->irq_get_irqchip_state function from
62e0468650c30f0298 "genirq: Add optional hardware synchronization for shutdown".

Both require deep XIVE knowledge which I do not have.
---
  arch/powerpc/sysdev/xive/common.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 082c7e1c20f0..65742e280337 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -148,8 +148,12 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool 
just_peek)
irq = xive_read_eq(>queue[prio], just_peek);
  
  		/* Found something ? That's it */

-   if (irq)
-   break;
+   if (irq) {
+   /* Another CPU may have shut this irq down, check it */
+   if (irq_to_desc(irq))
+   break;
+   irq = 0;
+   }
  
  		/* Clear pending bits */

xc->pending_prio &= ~(1 << prio);




--
Alexey


Re: [PATCH 1/3] KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting

2019-07-14 Thread Michael Ellerman
Suraj Jitindar Singh  writes:
> On Sat, 2019-07-13 at 13:47 +1000, Michael Ellerman wrote:
>> Suraj Jitindar Singh  writes:
...
>> > 
>> > Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest
>> > entry/exit path on P9 for radix guests"
>> 
>> I'm not clear why this and the next commit are marked as fixing the
>> above commit. Wasn't it broken prior to that commit as well?
>
> That was the commit which introduced the entry path which we use for a
> nested guest, the path on which we need to be saving and restoring the
> pmu registers and so where the new code was introduced.

OK, I thought that commit was an unrelated optimisation. Agree that is a
good target if it is the commit that introduced the nested path.

cheers


Re: [PATCH] powerpc: mm: Limit rma_size to 1TB when running without HV mode

2019-07-14 Thread Michael Ellerman
Suraj Jitindar Singh  writes:
> On Fri, 2019-07-12 at 23:09 +1000, Michael Ellerman wrote:
>> Suraj Jitindar Singh  writes:
>> > The virtual real mode addressing (VRMA) mechanism is used when a
>> > partition is using HPT (Hash Page Table) translation and performs
>> > real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
>> > mode effective address bits 0:23 are treated as zero (i.e. the
>> > access
>> > is aliased to 0) and the access is performed using an implicit 1TB
>> > SLB
>> > entry.
>> > 
>> > The size of the RMA (Real Memory Area) is communicated to the guest
>> > as
>> > the size of the first memory region in the device tree. And because
>> > of
>> > the mechanism described above can be expected to not exceed 1TB. In
>> > the
>> > event that the host erroneously represents the RMA as being larger
>> > than
>> > 1TB, guest accesses in real mode to memory addresses above 1TB will
>> > be
>> > aliased down to below 1TB. This means that a memory access
>> > performed in
>> > real mode may differ to one performed in virtual mode for the same
>> > memory
>> > address, which would likely have unintended consequences.
>> > 
>> > To avoid this outcome have the guest explicitly limit the size of
>> > the
>> > RMA to the current maximum, which is 1TB. This means that even if
>> > the
>> > first memory block is larger than 1TB, only the first 1TB should be
>> > accessed in real mode.
>> > 
>> > Signed-off-by: Suraj Jitindar Singh 
>> 
>> I added:
>> 
>> Fixes: c3ab300ea555 ("powerpc: Add POWER9 cputable entry")
>> Cc: sta...@vger.kernel.org # v4.6+
>> 
>> 
>> Which is not exactly correct, but probably good enough?
>
> I think we actually want:
> Fixes: c610d65c0ad0 ("powerpc/pseries: lift RTAS limit for hash")
>
> Which is what actually caused it to break and for the issue to present
> itself.

Thanks, I used that instead.

cheers


Re: [PATCH 1/3] KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting

2019-07-14 Thread Suraj Jitindar Singh
On Sat, 2019-07-13 at 13:47 +1000, Michael Ellerman wrote:
> Suraj Jitindar Singh  writes:
> > The performance monitoring unit (PMU) registers are saved on guest
> > exit
> > when the guest has set the pmcregs_in_use flag in its lppaca, if it
> > exists, or unconditionally if it doesn't. If a nested guest is
> > being
> > run then the hypervisor doesn't, and in most cases can't, know if
> > the
> > pmu registers are in use since it doesn't know the location of the
> > lppaca
> > for the nested guest, although it may have one for its immediate
> > guest.
> > This results in the values of these registers being lost across
> > nested
> > guest entry and exit in the case where the nested guest was making
> > use
> > of the performance monitoring facility while it's nested guest
> > hypervisor
> > wasn't.
> > 
> > Further more the hypervisor could interrupt a guest hypervisor
> > between
> > when it has loaded up the pmu registers and it calling
> > H_ENTER_NESTED or
> > between returning from the nested guest to the guest hypervisor and
> > the
> > guest hypervisor reading the pmu registers, in
> > kvmhv_p9_guest_entry().
> > This means that it isn't sufficient to just save the pmu registers
> > when
> > entering or exiting a nested guest, but that it is necessary to
> > always
> > save the pmu registers whenever a guest is capable of running
> > nested guests
> > to ensure the register values aren't lost in the context switch.
> > 
> > Ensure the pmu register values are preserved by always saving their
> > value into the vcpu struct when a guest is capable of running
> > nested
> > guests.
> > 
> > This should have minimal performance impact however any impact can
> > be
> > avoided by booting a guest with "-machine pseries,cap-nested-
> > hv=false"
> > on the qemu commandline.
> > 
> > Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest
> > entry/exit path on P9 for radix guests"
> 
> I'm not clear why this and the next commit are marked as fixing the
> above commit. Wasn't it broken prior to that commit as well?

That was the commit which introduced the entry path which we use for a
nested guest, the path on which we need to be saving and restoring the
pmu registers and so where the new code was introduced.

It wasn't technically broken prior to that commit since you couldn't
run nested prior to that commit, and in fact it's a few commits after
that one where we actually enabled the ability to run nested guests.

However since that's the code which introduced the nested entry path it
seemed like the best fit for the fixes tag for people who will be
looking for fixes in that area. Also all the other nested entry path
fixes used that fixes tag so it ties them together nicely.

Thanks,
Suraj

> 
> cheers
> 
> > Signed-off-by: Suraj Jitindar Singh 
> > ---
> >  arch/powerpc/kvm/book3s_hv.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c
> > b/arch/powerpc/kvm/book3s_hv.c
> > index ec1804f822af..b682a429f3ef 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -3654,6 +3654,8 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu
> > *vcpu, u64 time_limit,
> > vcpu->arch.vpa.dirty = 1;
> > save_pmu = lp->pmcregs_in_use;
> > }
> > +   /* Must save pmu if this guest is capable of running
> > nested guests */
> > +   save_pmu |= nesting_enabled(vcpu->kvm);
> >  
> > kvmhv_save_guest_pmu(vcpu, save_pmu);
> >  
> > -- 
> > 2.13.6


Re: [PATCH] powerpc: mm: Limit rma_size to 1TB when running without HV mode

2019-07-14 Thread Suraj Jitindar Singh
On Fri, 2019-07-12 at 23:09 +1000, Michael Ellerman wrote:
> Suraj Jitindar Singh  writes:
> > The virtual real mode addressing (VRMA) mechanism is used when a
> > partition is using HPT (Hash Page Table) translation and performs
> > real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
> > mode effective address bits 0:23 are treated as zero (i.e. the
> > access
> > is aliased to 0) and the access is performed using an implicit 1TB
> > SLB
> > entry.
> > 
> > The size of the RMA (Real Memory Area) is communicated to the guest
> > as
> > the size of the first memory region in the device tree. And because
> > of
> > the mechanism described above can be expected to not exceed 1TB. In
> > the
> > event that the host erroneously represents the RMA as being larger
> > than
> > 1TB, guest accesses in real mode to memory addresses above 1TB will
> > be
> > aliased down to below 1TB. This means that a memory access
> > performed in
> > real mode may differ to one performed in virtual mode for the same
> > memory
> > address, which would likely have unintended consequences.
> > 
> > To avoid this outcome have the guest explicitly limit the size of
> > the
> > RMA to the current maximum, which is 1TB. This means that even if
> > the
> > first memory block is larger than 1TB, only the first 1TB should be
> > accessed in real mode.
> > 
> > Signed-off-by: Suraj Jitindar Singh 
> 
> I added:
> 
> Fixes: c3ab300ea555 ("powerpc: Add POWER9 cputable entry")
> Cc: sta...@vger.kernel.org # v4.6+
> 
> 
> Which is not exactly correct, but probably good enough?

I think we actually want:
Fixes: c610d65c0ad0 ("powerpc/pseries: lift RTAS limit for hash")

Which is what actually caused it to break and for the issue to present
itself.

> 
> cheers
> 
> > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c
> > b/arch/powerpc/mm/book3s64/hash_utils.c
> > index 28ced26f2a00..4d0e2cce9cd5 100644
> > --- a/arch/powerpc/mm/book3s64/hash_utils.c
> > +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> > @@ -1901,11 +1901,19 @@ void
> > hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
> >  *
> >  * For guests on platforms before POWER9, we clamp the it
> > limit to 1G
> >  * to avoid some funky things such as RTAS bugs etc...
> > +* On POWER9 we limit to 1TB in case the host erroneously
> > told us that
> > +* the RMA was >1TB. Effective address bits 0:23 are
> > treated as zero
> > +* (meaning the access is aliased to zero i.e. addr = addr
> > % 1TB)
> > +* for virtual real mode addressing and so it doesn't make
> > sense to
> > +* have an area larger than 1TB as it can't be addressed.
> >  */
> > if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
> > ppc64_rma_size = first_memblock_size;
> > if (!early_cpu_has_feature(CPU_FTR_ARCH_300))
> > ppc64_rma_size = min_t(u64,
> > ppc64_rma_size, 0x4000);
> > +   else
> > +   ppc64_rma_size = min_t(u64,
> > ppc64_rma_size,
> > +  1UL <<
> > SID_SHIFT_1T);
> >  
> > /* Finally limit subsequent allocations */
> > memblock_set_current_limit(ppc64_rma_size);
> > -- 
> > 2.13.6


Re: [PATCH v4 6/8] KVM: PPC: Ultravisor: Restrict LDBAR access

2019-07-14 Thread Claudio Carvalho


On 7/11/19 9:57 AM, Michael Ellerman wrote:
> Claudio Carvalho  writes:
>> When the ultravisor firmware is available, it takes control over the
>> LDBAR register. In this case, thread-imc updates and save/restore
>> operations on the LDBAR register are handled by ultravisor.
> Please roll up the replies to Alexey's question about LDBAR into the
> change log.
>
>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
>> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> index f9b2620fbecd..cffb365d9d02 100644
>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> @@ -375,8 +375,10 @@ BEGIN_FTR_SECTION
>>  mtspr   SPRN_RPR, r0
>>  ld  r0, KVM_SPLIT_PMMAR(r6)
>>  mtspr   SPRN_PMMAR, r0
>> +BEGIN_FW_FTR_SECTION_NESTED(70)
>>  ld  r0, KVM_SPLIT_LDBAR(r6)
>>  mtspr   SPRN_LDBAR, r0
>> +END_FW_FTR_SECTION_NESTED(FW_FEATURE_ULTRAVISOR, 0, 70)
> That's in Power8 code isn't it? Which will never have an ultravisor.

IIUC, it might be executed in Power9 as well, but I can double check that.


>
>> diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
>> b/arch/powerpc/platforms/powernv/opal-imc.c
>> index 1b6932890a73..5fe2d4526cbc 100644
>> --- a/arch/powerpc/platforms/powernv/opal-imc.c
>> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
>> @@ -254,6 +254,10 @@ static int opal_imc_counters_probe(struct 
>> platform_device *pdev)
>>  bool core_imc_reg = false, thread_imc_reg = false;
>>  u32 type;
>>  
>> +/* Disable IMC devices, when Ultravisor is enabled. */
>> +if (firmware_has_feature(FW_FEATURE_ULTRAVISOR))
>> +return -EACCES;
> I don't mind taking this change. But at the same time should the IMC
> stuff just be omitted from the device tree when we're in ultravisor mode?

Yes. Maddy said that he will patch skiboot to remove the IMC nodes if
ultravisor is present.

I added this check just to protect the kernel in case skiboot is not in the
right level for some
reason.

Thanks,
Claudio



>
> cheers
>



Re: [RFC PATCH kernel] powerpc/xive: Drop deregistered irqs

2019-07-14 Thread Benjamin Herrenschmidt
On Sun, 2019-07-14 at 21:44 +0200, Cédric Le Goater wrote:
> > Well, best is probably to do just that though, but call it something
> > like ppc_md.orphan_irq() or something like that instead. Another option
> > as you mention is to try to scrub queues, but that's trickier to do due
> > to the lockless nature of the queue handling.
> 
> When the IRQ is shutdown, couldn't we cleanup the CPU EQ by filtering 
> all the dangling entries, and replacing them with zeroes ? That would
> be alternative 1, but I don't think we need to scan all cpus. The last
> target should be enough.

It's a bit tricky due to the lockless nature of the queues...

Cheers,
Ben.




[PATCH] powerpc/kmcent2: update the ethernet devices' phy properties

2019-07-14 Thread Valentin Longchamp
Change all phy-connection-type properties to phy-mode that are better
supported by the fman driver.

Use the more readable fixed-link node for the 2 sgmii links.

Change the RGMII link to rgmii-id as the clock delays are added by the
phy.

Signed-off-by: Valentin Longchamp 
---
 arch/powerpc/boot/dts/fsl/kmcent2.dts | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/kmcent2.dts 
b/arch/powerpc/boot/dts/fsl/kmcent2.dts
index 48b7f9797124..c3e0741cafb1 100644
--- a/arch/powerpc/boot/dts/fsl/kmcent2.dts
+++ b/arch/powerpc/boot/dts/fsl/kmcent2.dts
@@ -210,13 +210,19 @@
 
fman@40 {
ethernet@e {
-   fixed-link = <0 1 1000 0 0>;
-   phy-connection-type = "sgmii";
+   phy-mode = "sgmii";
+   fixed-link {
+   speed = <1000>;
+   full-duplex;
+   };
};
 
ethernet@e2000 {
-   fixed-link = <1 1 1000 0 0>;
-   phy-connection-type = "sgmii";
+   phy-mode = "sgmii";
+   fixed-link {
+   speed = <1000>;
+   full-duplex;
+   };
};
 
ethernet@e4000 {
@@ -229,7 +235,7 @@
 
ethernet@e8000 {
phy-handle = <_phy>;
-   phy-connection-type = "rgmii";
+   phy-mode = "rgmii-id";
};
 
mdio0: mdio@fc000 {
-- 
2.17.1



Re: [RFC PATCH kernel] powerpc/xive: Drop deregistered irqs

2019-07-14 Thread Daniel Henrique Barboza

This patch fixed an issue I was experiencing with virsh start/destroy
of guests with mlx5 and GPU passthrough in a Power 9 server. I
believe it's a similar situation which Alexey described in the post
commit msg.


Tested-by: Daniel Henrique Barboza 


On 7/12/19 5:20 AM, Alexey Kardashevskiy wrote:

There is a race between releasing an irq on one cpu and fetching it
from XIVE on another cpu as there does not seem to be any locking between
these, probably because xive_irq_chip::irq_shutdown() is supposed to
remove the irq from all queues in the system which it does not do.

As a result, when such released irq appears in a queue, we take it
from the queue but we do not change the current priority on that cpu and
since there is no handler for the irq, EOI is never called and the cpu
current priority remains elevated (7 vs. 0xff==unmasked). If another irq
is assigned to the same cpu, then that device stops working until irq
is moved to another cpu or the device is reset.

This checks if irq is still registered, if not, it assumes no valid irq
was fetched from the loop and if there is none left, it continues to
the irq==0 case (not visible in this patch) and sets priority to 0xff
which is basically unmasking. This calls irq_to_desc() on a hot path now
which is a radix tree lookup; hopefully this won't be noticeable as
that tree is quite small.

Signed-off-by: Alexey Kardashevskiy 
---

Found it on P9 system with:
- a host with 8 cpus online
- a boot disk on ahci with its msix on cpu#0
- a guest with 2xGPUs + 6xNVLink + 4 cpus
- GPU#0 from the guest is bound to the same cpu#0.

Killing a guest killed ahci and therefore the host because of the race.
Note that VFIO masks interrupts first and only then resets the device.

Alternatives:

1. Fix xive_irq_chip::irq_shutdown() to walk through all cpu queues and
drop deregistered irqs.

2. Exploit chip->irq_get_irqchip_state function from
62e0468650c30f0298 "genirq: Add optional hardware synchronization for shutdown".

Both require deep XIVE knowledge which I do not have.
---
  arch/powerpc/sysdev/xive/common.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 082c7e1c20f0..65742e280337 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -148,8 +148,12 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool 
just_peek)
irq = xive_read_eq(>queue[prio], just_peek);
  
  		/* Found something ? That's it */

-   if (irq)
-   break;
+   if (irq) {
+   /* Another CPU may have shut this irq down, check it */
+   if (irq_to_desc(irq))
+   break;
+   irq = 0;
+   }
  
  		/* Clear pending bits */

xc->pending_prio &= ~(1 << prio);




Re: [RFC PATCH kernel] powerpc/xive: Drop deregistered irqs

2019-07-14 Thread Cédric Le Goater
On 14/07/2019 03:31, Benjamin Herrenschmidt wrote:
> On Sat, 2019-07-13 at 18:53 +1000, Alexey Kardashevskiy wrote:
>>
>> On 13/07/2019 09:47, Benjamin Herrenschmidt wrote:
>>> On Fri, 2019-07-12 at 19:37 +1000, Alexey Kardashevskiy wrote:

>>>
>>>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/irq.c#n614

 If so, then in order to do EOI, I'll need the desc which is gone,
 or
 I am missing the point?
>>>
>>> All you need is drop the local CPU priority.
>>
>> I know that, the question was how. I cannot use irq_chip in
>> arch/powerpc/kernel/irq.c and I do not want to add
>> ppc_md.enable_irqs().
> 
> Well, best is probably to do just that though, but call it something
> like ppc_md.orphan_irq() or something like that instead. Another option
> as you mention is to try to scrub queues, but that's trickier to do due
> to the lockless nature of the queue handling.

When the IRQ is shutdown, couldn't we cleanup the CPU EQ by filtering 
all the dangling entries, and replacing them with zeroes ? That would
be alternative 1, but I don't think we need to scan all cpus. The last
target should be enough.

C.   



[PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-14 Thread Christian Brauner
A while ago Arnd made it possible to give new system calls the same
syscall number on all architectures (except alpha). To not break this
nice new feature let's mark 435 for clone3 as reserved on all
architectures that do not yet implement it.
Even if an architecture does not plan to implement it this ensures that
new system calls coming after clone3 will have the same number on all
architectures.

Signed-off-by: Christian Brauner 
Cc: Arnd Bergmann 
Cc: linux-a...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
---
 arch/alpha/kernel/syscalls/syscall.tbl| 1 +
 arch/ia64/kernel/syscalls/syscall.tbl | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl   | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl  | 1 +
 arch/s390/kernel/syscalls/syscall.tbl | 1 +
 arch/sh/kernel/syscalls/syscall.tbl   | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl| 1 +
 11 files changed, 11 insertions(+)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 1db9bbcfb84e..728fe028c02c 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -474,3 +474,4 @@
 542common  fsmount sys_fsmount
 543common  fspick  sys_fspick
 544common  pidfd_open  sys_pidfd_open
+# 545 reserved for clone3
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
b/arch/ia64/kernel/syscalls/syscall.tbl
index ecc44926737b..36d5faf4c86c 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -355,3 +355,4 @@
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
b/arch/m68k/kernel/syscalls/syscall.tbl
index 9a3eb2558568..a88a285a0e5f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -434,3 +434,4 @@
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl 
b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 97035e19ad03..c9c879ec9b6d 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -373,3 +373,4 @@
 432n32 fsmount sys_fsmount
 433n32 fspick  sys_fspick
 434n32 pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl 
b/arch/mips/kernel/syscalls/syscall_n64.tbl
index d7292722d3b0..bbce9159caa1 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -349,3 +349,4 @@
 432n64 fsmount sys_fsmount
 433n64 fspick  sys_fspick
 434n64 pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl 
b/arch/mips/kernel/syscalls/syscall_o32.tbl
index dba084c92f14..9653591428ec 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -422,3 +422,4 @@
 432o32 fsmount sys_fsmount
 433o32 fspick  sys_fspick
 434o32 pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl 
b/arch/parisc/kernel/syscalls/syscall.tbl
index 5022b9e179c2..c7aadfef5386 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -431,3 +431,4 @@
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl 
b/arch/powerpc/kernel/syscalls/syscall.tbl
index f2c3bda2d39f..3331749aab20 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -516,3 +516,4 @@
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
 434common  pidfd_open

Re: [PATCH 01/12] Documentation: move architectures together

2019-07-14 Thread Alex Shi
Hi Jon,

Thanks for quick response!

在 2019/7/13 上午1:34, Jonathan Corbet 写道:
> On Fri, 12 Jul 2019 10:20:07 +0800
> Alex Shi  wrote:
> 
>> There are many different archs in Documentation/ dir, it's better to
>> move them together in 'Documentation/arch' which follows from kernel source.
> 
> So this seems certain to collide badly with Mauro's RST-conversion monster
> patch set.

I don't mean that, sorry, and we can figure out if both of them are worthy to 
be picked up.

> 
> More to the point, though...if we are going to thrash up things this
> badly, we want to be sure that we're doing it right so we don't end up
> renaming everything again.  Grouping stuff into a new arch/ subdirectory
> adds a bit of order, but it doesn't do much toward trying to organize our
> documentation for its readers, and it doesn't help us to modernize the
> docs and get rid of the old, useless stuff.  A quick check shows that many
> of these files have seen no changes other than typo fixes since the
> beginning of the Git era.

Right, there is some docs which need to be update or even drop, and a reorder 
would be a timing to push each of arch maintainer to do something, isn't it? 
Anyway, reordering  documents like kernel source dir could be one of choices. :)

> 
> So, in my mind, this needs some thought.  Maybe we want a
> Documentation/arch in the end, but I'm not convinced that we should just
> create it and fill it with a snow shovel.  This might be a good thing to
> discuss at the kernel summit in September.

Thanks for considering, anyway, it could be a good start to get hands dirty 
whether the arch dir is needed. 

Thanks
Alex


Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags

2019-07-14 Thread Al Viro
On Sun, Jul 14, 2019 at 05:00:29PM +1000, Aleksa Sarai wrote:

> The basic property being guaranteed by LOOKUP_IN_ROOT is that it will
> not result in resolution of a path component which was not inside the
> root of the dirfd tree at some point during resolution (and that all
> absolute symlink and ".." resolution will be done relative to the
> dirfd). This may smell slightly of chroot(2), because unfortunately it
> is a similar concept -- the reason for this is to allow for a more
> efficient way to safely resolve paths inside a rootfs than spawning a
> separate process to then pass back the fd to the caller.

IDGI...  If attacker can modify your subtree, you have already lost -
after all, they can make anything appear inside that tree just before
your syscall is made and bring it back out immediately afterwards.
And if they can't, what is the race you are trying to protect against?
Confused...


[Bug 200055] WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3214 .__lockdep_init_map+0x260/0x270

2019-07-14 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=200055

Erhard F. (erhar...@mailbox.org) changed:

   What|Removed |Added

 Attachment #281901|0   |1
is obsolete||

--- Comment #18 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 283681
  --> https://bugzilla.kernel.org/attachment.cgi?id=283681=edit
4.14.132 kernel .config (G5 11,2)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 200055] WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3214 .__lockdep_init_map+0x260/0x270

2019-07-14 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=200055

--- Comment #17 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 283679
  --> https://bugzilla.kernel.org/attachment.cgi?id=283679=edit
4.9.184 kernel .config (G5 11,2)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 200055] WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3214 .__lockdep_init_map+0x260/0x270

2019-07-14 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=200055

--- Comment #16 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 283677
  --> https://bugzilla.kernel.org/attachment.cgi?id=283677=edit
dmesg.txt (G5 11,2 + kernel 4.14.132)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 200055] WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3214 .__lockdep_init_map+0x260/0x270

2019-07-14 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=200055

--- Comment #15 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 283675
  --> https://bugzilla.kernel.org/attachment.cgi?id=283675=edit
dmesg.txt (G5 11,2 + kernel 4.9.184)

By running some older kernels from the stable series I noticed that 4.4.x and
4.9.x are not affected by this lockdep issue. First stable series it shows up 
is 4.14.x.

I am rather curious what a bisect will throw up when I have time to do it. ;)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags

2019-07-14 Thread Aleksa Sarai
On 2019-07-12, Al Viro  wrote:
> On Fri, Jul 12, 2019 at 01:55:52PM +0100, Al Viro wrote:
> > On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote:
> > > On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:
> > > 
> > > > > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct 
> > > > > > nameidata *nd, unsigned flags)
> > > > > > s = ERR_PTR(error);
> > > > > > return s;
> > > > > > }
> > > > > > -   error = dirfd_path_init(nd);
> > > > > > -   if (unlikely(error))
> > > > > > -   return ERR_PTR(error);
> > > > > > +   if (likely(!nd->path.mnt)) {
> > > > > 
> > > > > Is that a weird way of saying "if we hadn't already called 
> > > > > dirfd_path_init()"?
> > > > 
> > > > Yes. I did it to be more consistent with the other "have we got the
> > > > root" checks elsewhere. Is there another way you'd prefer I do it?
> > > 
> > > "Have we got the root" checks are inevitable evil; here you are making the
> > > control flow in a single function hard to follow.
> > > 
> > > I *think* what you are doing is
> > >   absolute pathname, no LOOKUP_BENEATH:
> > >   set_root
> > >   error = nd_jump_root(nd)
> > >   else
> > >   error = dirfd_path_init(nd)
> > >   return unlikely(error) ? ERR_PTR(error) : s;
> > > which should be a lot easier to follow (not to mention shorter), but I 
> > > might
> > > be missing something in all of that.
> > 
> > PS: if that's what's going on, I would be tempted to turn the entire
> > path_init() part into this:
> > if (flags & LOOKUP_BENEATH)
> > while (*s == '/')
> > s++;
> > in the very beginning (plus the handling of nd_jump_root() prototype
> > change, but that belongs with nd_jump_root() change itself, obviously).
> > Again, I might be missing something here...
> 
> Argh... I am, at that - you have setting path->root (and grabbing it)
> in LOOKUP_BENEATH cases and you do it after dirfd_path_init().  So
> how about
>   if (flags & LOOKUP_BENEATH)
>   while (*s == '/')
>   s++;

I can do this for LOOKUP_IN_ROOT, but currently the semantics for
LOOKUP_BENEATH is that absolute paths will return -EXDEV
indiscriminately (nd_jump_root() errors out with LOOKUP_BENEATH). To be
honest, the check could actually just be:

  if (flags & LOOKUP_BENEATH)
if (*s == '/')
  return ERR_PTR(-EXDEV);

(Though we'd still need -EXDEV in nd_jump_root() for obvious reasons.)

The logic being that an absolute path means that the resolution starts
out without being "beneath" the starting point -- thus violating the
contract of LOOKUP_BENEATH. And since the "handle absolute paths like
they're scoped to the root" is only implemented for LOOKUP_IN_ROOT, I'd
think it's a bit odd to have LOOKUP_BENEATH do it too for absolute
paths.

I'll be honest, this patchset is more confusing to both of us because of
LOOKUP_BENEATH -- I've only kept it since it was part of the original
patchset (O_BENEATH). Personally I think more people will be far more
interested in LOOKUP_IN_ROOT. Does anyone mind if I drop the
LOOKUP_BENEATH parts of this series, and only keep LOOKUP_NO_* and
LOOKUP_IN_ROOT?

I make a change as you outlined for LOOKUP_IN_ROOT, though.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v9 01/10] namei: obey trailing magic-link DAC permissions

2019-07-14 Thread Aleksa Sarai
On 2019-07-12, Al Viro  wrote:
> On Fri, Jul 12, 2019 at 10:20:17PM +1000, Aleksa Sarai wrote:
> > On 2019-07-12, Al Viro  wrote:
> > > On Sun, Jul 07, 2019 at 12:57:28AM +1000, Aleksa Sarai wrote:
> > > > @@ -514,7 +516,14 @@ static void set_nameidata(struct nameidata *p, int 
> > > > dfd, struct filename *name)
> > > > p->stack = p->internal;
> > > > p->dfd = dfd;
> > > > p->name = name;
> > > > -   p->total_link_count = old ? old->total_link_count : 0;
> > > > +   p->total_link_count = 0;
> > > > +   p->acc_mode = 0;
> > > > +   p->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
> > > > +   if (old) {
> > > > +   p->total_link_count = old->total_link_count;
> > > > +   p->acc_mode = old->acc_mode;
> > > > +   p->opath_mask = old->opath_mask;
> > > > +   }
> > > 
> > > Huh?  Could somebody explain why traversals of NFS4 referrals should 
> > > inherit
> > > ->acc_mode and ->opath_mask?
> > 
> > I'll be honest -- I don't understand what set_nameidata() did so I just
> > did what I thought would be an obvious change (to just copy the
> > contents). I thought it was related to some aspect of the symlink stack
> > handling.
> 
> No.  It's handling of (very rare) nested pathwalk.  The only case I can think
> of is handling of NFS4 referrals - they are triggered by ->d_automount()
> and include NFS4 mount.  Which does internal pathwalk of its own, to get
> to the root of subtree being automounted.
> 
> NFS has its own recursion protection on that path (no deeper nesting than
> one level of referral traversals), but there some nesting is inevitable;
> we do get another nameidata instance on stack.  And for nd_jump_link() we
> need to keep track of the innermost one.
> 
> For symlinks nothing of that sort happens - they are dealt with on the same
> struct nameidata.  ->total_link_count copying is there for one reason only -
> we want the total amount of symlinks traversed during the pathwalk (including
> the referral processing, etc.) to count towards MAXSYMLINKS check.  It 
> could've
> been moved from nameidata to task_struct, but it's cheaper to handle it that
> way.
> 
> Again, nesting is *rare*.

Thanks for the explanation, much appreciated. I will drop the old->...
copying hunk.

> > In that case, should they both be set to 0 on set_nameidata()? This will
> > mean that fd re-opening (or magic-link opening) through a
> > set_nameidata() would always fail.
> 
> Huh?  set_nameidata() is done for *all* instances - it's pretty much the
> constructor of that object (and restore_nameidata() - a destructor).
> Everything goes through it.

Sorry, I meant to drop the copy-from-old logic -- not set it to zero
explicitly in set_nameidata().

> And again, I'm not sure we want these fields in nameidata - IMO they belong
> in open_flags.  Things like e.g. stat() don't need them at all.

Yup, I'll work up a version that does the consolidation you mentioned
in your other mail.

> Incidentally, O_PATH opening of symlinks combined with subsequent procfs
> symlink traversals is worth testing - that's where the things get subtle
> and that's where it's easy to get in trouble on modifications.

I have some self-tests of a symlink-to-a-magic-link in the last patch of
the series. Did you mean something even more chained like a symlink to a
/proc/self/fd/$n of an O_NOFOLLOW|O_PATH of a symlink?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags

2019-07-14 Thread Aleksa Sarai
On 2019-07-13, Al Viro  wrote:
> On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
> > On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
> > 
> > >   if (flags & LOOKUP_BENEATH) {
> > >   nd->root = nd->path;
> > >   if (!(flags & LOOKUP_RCU))
> > >   path_get(>root);
> > >   else
> > >   nd->root_seq = nd->seq;
> > 
> > BTW, this assignment is needed for LOOKUP_RCU case.  Without it
> > you are pretty much guaranteed that lazy pathwalk will fail,
> > when it comes to complete_walk().
> > 
> > Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH
> > combination would someday get passed?
> 
> I don't understand what's going on with ->r_seq in there - your
> call of path_is_under() is after having (re-)sampled rename_lock,
> but if that was the only .. in there, who's going to recheck
> the value?  For that matter, what's to guarantee that the thing
> won't get moved just as you are returning from handle_dots()?
> 
> IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?

I tried to explain this in the commit message for "namei: aggressively
check for nd->root escape on ".." resolution", but I probably could've
explained it better.

The basic property being guaranteed by LOOKUP_IN_ROOT is that it will
not result in resolution of a path component which was not inside the
root of the dirfd tree at some point during resolution (and that all
absolute symlink and ".." resolution will be done relative to the
dirfd). This may smell slightly of chroot(2), because unfortunately it
is a similar concept -- the reason for this is to allow for a more
efficient way to safely resolve paths inside a rootfs than spawning a
separate process to then pass back the fd to the caller.

We don't want to do a path_is_under() check for every ".." (otherwise
lookups have a quadratic slowdown when doing many ".."s), so I instead
only do a check after a rename or a mount (which are the only operations
which could change what ".." points to). And since we do the
path_is_under() check if m_seq or r_seq need a retry, we can re-take
them[+].

The main reason why I don't re-check path_is_under() after handle_dots()
is that there is no way to be sure that a racing rename didn't happen
after your last path_is_under() check. The rename could happen after the
syscall returns, after all.

So, the main purpose of the check is to make sure that a ".."s after a
rename doesn't result in an escape. If the rename happens after we've
traversed through a ".." that means that the ".." was inside the root in
the first place (a root ".." is handled by follow_dotdot). If the rename
happens after we've gone through handle_dots() and there is no
subsequent ".." then to userspace it looks identical to the rename
occurring after the syscall has returned. If there is a subsequent ".."
after a racing rename then we may have moved into a path that wasn't
path_is_under() and so we have to check it.

The only way I could see you could solve the race completely is if you
had a way for userspace to lock things from being able to be renamed (or
MS_MOVE'd). And that feels like a really bad idea to me.

[+]: You asked why don't I re-take m_seq. The reason is that I don't
 fully understand all the other m_seq checks being done during
 resolution, and we aren't definitely doing them all in
 handle_dots(). So I assumed re-taking it could result in me
 breaking RCU-walk which obviously would be bad. Since I am the only
 thing using nd->r_seq, I can re-take it without issue.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature