Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

2024-05-31 Thread Oliver Upton
On Fri, May 31, 2024 at 02:31:17PM -0600, Yu Zhao wrote:
> On Fri, May 31, 2024 at 1:24 AM Oliver Upton  wrote:

[...]

> > Grabbing the MMU lock for write to scan sucks, no argument there. But
> > can you please be specific about the impact of read lock v. RCU in the
> > case of arm64? I had asked about this before and you never replied.
> >
> > My concern remains that adding support for software table walkers
> > outside of the MMU lock entirely requires more work than just deferring
> > the deallocation to an RCU callback. Walkers that previously assumed
> > 'exclusive' access while holding the MMU lock for write must now cope
> > with volatile PTEs.
> >
> > Yes, this problem already exists when hardware sets the AF, but the
> > lock-free walker implementation needs to be generic so it can be applied
> > for other PTE bits.
> 
> Direct reclaim is multi-threaded and each reclaimer can take the mmu
> lock for read (testing the A-bit) or write (unmapping before paging
> out) on arm64. The fundamental problem of using the readers-writer
> lock in this case is priority inversion: the readers have lower
> priority than the writers, so ideally, we don't want the readers to
> block the writers at all.

So we already have this sort of problem of stage-2 fault handling v.
secondary MMU invalidations, which is why I've been doubtful of the
perceived issue. In fact, I'd argue that needing to wait for faults is
worse than aging participation since those can be trivially influenced
by userspace/guest.

In any case, we shouldn't ever be starved since younger readers cannot
enter the critical section with a pending writer.

> As I said earlier, I prefer we drop the arm64 support for now, but I
> will not object to taking the mmu lock for read when clearing the
> A-bit, as long as we fully understand the problem here and document it
> clearly.

I'd be convinced of this if there's data that shows read lock
acquisition is in fact consequential. Otherwise, I'm not sure the added
complexity of RCU table walkers (per my statement above) is worth the
effort / maintenance burden.

-- 
Thanks,
Oliver


Re: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

2024-05-31 Thread Oliver Upton
On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> On Wed, May 29, 2024 at 06:05:09PM +, James Houghton wrote:
> 
> [...]
> 
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 9e2bbee77491..eabb07c66a07 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct 
> > kvm_pgtable_visit_ctx *ctx,
> > data->young = true;
> >  
> > /*
> > -* stage2_age_walker() is always called while holding the MMU lock for
> > -* write, so this will always succeed. Nonetheless, this deliberately
> > -* follows the race detection pattern of the other stage-2 walkers in
> > -* case the locking mechanics of the MMU notifiers is ever changed.
> > +* This walk may not be exclusive; the PTE is permitted to change
> > +* from under us.
> >  */
> > if (data->mkold && !stage2_try_set_pte(ctx, new))
> > return -EAGAIN;
> 
> It is probably worth mentioning that if there was a race to update the
> PTE then the GFN is most likely young, so failing to clear AF probably
> isn't even consequential.

Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
now. Maybe demote it to:

  r = kvm_pgtable_walk(...);
  WARN_ON_ONCE(r && r != -EAGAIN);

-- 
Thanks,
Oliver


Re: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

2024-05-31 Thread Oliver Upton
On Wed, May 29, 2024 at 06:05:09PM +, James Houghton wrote:

[...]

> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 9e2bbee77491..eabb07c66a07 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct 
> kvm_pgtable_visit_ctx *ctx,
>   data->young = true;
>  
>   /*
> -  * stage2_age_walker() is always called while holding the MMU lock for
> -  * write, so this will always succeed. Nonetheless, this deliberately
> -  * follows the race detection pattern of the other stage-2 walkers in
> -  * case the locking mechanics of the MMU notifiers is ever changed.
> +  * This walk may not be exclusive; the PTE is permitted to change
> +  * from under us.
>*/
>   if (data->mkold && !stage2_try_set_pte(ctx, new))
>   return -EAGAIN;

It is probably worth mentioning that if there was a race to update the
PTE then the GFN is most likely young, so failing to clear AF probably
isn't even consequential.

-- 
Thanks,
Oliver


Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

2024-05-31 Thread Oliver Upton
On Fri, May 31, 2024 at 10:45:04AM -0600, Yu Zhao wrote:
> On Fri, May 31, 2024 at 1:03 AM Oliver Upton  wrote:
> >
> > On Fri, May 31, 2024 at 12:05:48AM -0600, Yu Zhao wrote:
> 
> Let me add back what I said earlier:
> 
>   I'm not convinced, but it doesn't mean your point of view is
>   invalid. If you fully understand the implications of your design
>   choice and document them, I will not object.

Thanks, I appreciate the sentiment. Hopefully we can align here.

> > > All optimizations in v2 were measured step by step. Even that bitmap,
> > > which might be considered overengineered, brought a readily
> > > measuarable 4% improvement in memcached throughput on Altra Max
> > > swapping to Optane:
> >
> > That's great, but taking an iterative approach to the problem allows
> > the reviewers and maintainers to come to their own conclusions about
> > each optimization independently. Squashing all of that together and
> > posting the result doesn't allow for this.
> 
> That's your methodology, which I respect: as I said I won't stand in your way.
> 
> But mine is backed by data, please do respect that as well,

Data is useful and expected for changes that aim to improve the
performance of a system in one way or another. That is, after all, the
sole intention of the work, no?

What I'm also looking for is a controlled experiment, where a single
independent variable (e.g. locking) can be evaluated against the
baseline. All-or-nothing data has limited usefulness.

> by doing what I asked: document your justifications.

The justification for a series is against the upstream tree, not some
out-of-tree stuff. The cover letter explicitly calls out the decision
to simplify the patch series along with performance data I can reproduce
on my own systems.

This is a perfect example of how to contribute changes upstream.

> > > What I don't think is acceptable is simplifying those optimizations
> > > out without documenting your justifications (I would even call it a
> > > design change, rather than simplification, from v3 to v4).
> >
> > No, sorry, there's nothing wrong with James' approach here.
> 
> Sorry, are you saying "without documenting your justifications" is
> nothing wrong? If so, please elaborate.

As I mentioned above, the reasoning is adequately documented and the
discussion that led to v4 is public. OTOH...

> > The discussion that led to the design of v4 happened on list; you were
> > on CC. The general consensus on the KVM side was that the bitmap was
> > complicated and lacked independent justification. There was ample
> > opportunity to voice your concerns before he spent the time on v4.
> 
> Please re-read my previous emails -- I never object to the removal of
> the bitmap or James' approach.
> 
> And please stop making assumptions -- I did voice my concerns with
> James privately.
^

If it happened in private then its no better than having said nothing at
all.

Please, keep the conversation on-list next time so we can iron out any
disagreements there. Otherwise contributors are put in a *very* awkward
situation of mediating the on- and off-list dialogue.

> > You seriously cannot fault a contributor for respinning their work based
> > on the provided feedback.
> 
> Are you saying I faulted James for taking others' feedback?

No. Sufficient justification is captured in the public review feedback +
series cover letter. Your statement that the approach was changed without
justification is unsubstantiated.

> Also what do you think about the technical flaws and inaccurate
> understandings I pointed out? You seem to have a strong opinion on
> your iterate approach, but I hope you didn't choose to overlook the
> real meat of this discussion.

Serious question: are you not receiving my mail or something?

I re-raised my question to you from ages ago about locking on the arm64
MMU. You didn't answer last time, I'd appreciate a reply this time
around.

Otherwise I couldn't be bothered about the color of the Kconfig bikeshed
and don't have anything meaningful to add there. I think the three of
you are trending in the right direction.

-- 
Thanks,
Oliver


Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

2024-05-31 Thread Oliver Upton
On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote:
> On Wed, May 29, 2024 at 12:05 PM James Houghton  wrote:
> >
> > Secondary MMUs are currently consulted for access/age information at
> > eviction time, but before then, we don't get accurate age information.
> > That is, pages that are mostly accessed through a secondary MMU (like
> > guest memory, used by KVM) will always just proceed down to the oldest
> > generation, and then at eviction time, if KVM reports the page to be
> > young, the page will be activated/promoted back to the youngest
> > generation.
> 
> Correct, and as I explained offline, this is the only reasonable
> behavior if we can't locklessly walk secondary MMUs.
> 
> Just for the record, the (crude) analogy I used was:
> Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> but you are only allowed to pick up 10 of them (and put them in your
> pocket). A smart move would be to survey the room *first and then*
> pick up the largest ones. But if you are carrying a 500 lbs backpack,
> you would just want to pick up whichever that's in front of you rather
> than walk the entire room.
> 
> MGLRU should only scan (or lookaround) secondary MMUs if it can be
> done lockless. Otherwise, it should just fall back to the existing
> approach, which existed in previous versions but is removed in this
> version.

Grabbing the MMU lock for write to scan sucks, no argument there. But
can you please be specific about the impact of read lock v. RCU in the
case of arm64? I had asked about this before and you never replied.

My concern remains that adding support for software table walkers
outside of the MMU lock entirely requires more work than just deferring
the deallocation to an RCU callback. Walkers that previously assumed
'exclusive' access while holding the MMU lock for write must now cope
with volatile PTEs.

Yes, this problem already exists when hardware sets the AF, but the
lock-free walker implementation needs to be generic so it can be applied
for other PTE bits.

-- 
Thanks,
Oliver


Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

2024-05-31 Thread Oliver Upton
On Fri, May 31, 2024 at 12:05:48AM -0600, Yu Zhao wrote:

[...]

> All optimizations in v2 were measured step by step. Even that bitmap,
> which might be considered overengineered, brought a readily
> measuarable 4% improvement in memcached throughput on Altra Max
> swapping to Optane:

That's great, but taking an iterative approach to the problem allows
the reviewers and maintainers to come to their own conclusions about
each optimization independently. Squashing all of that together and
posting the result doesn't allow for this.

Even if we were to take the series as-is, the door is wide open to
subsequent improvements.

> What I don't think is acceptable is simplifying those optimizations
> out without documenting your justifications (I would even call it a
> design change, rather than simplification, from v3 to v4).

No, sorry, there's nothing wrong with James' approach here.

The discussion that led to the design of v4 happened on list; you were
on CC. The general consensus on the KVM side was that the bitmap was
complicated and lacked independent justification. There was ample
opportunity to voice your concerns before he spent the time on v4.

You seriously cannot fault a contributor for respinning their work based
on the provided feedback.

-- 
Thanks,
Oliver


Re: [PATCH v2 1/6] KVM: Add a flag to track if a loaded vCPU is scheduled out

2024-05-22 Thread Oliver Upton
On Tue, May 21, 2024 at 06:40:08PM -0700, Sean Christopherson wrote:
> Add a kvm_vcpu.scheduled_out flag to track if a vCPU is in the process of
> being scheduled out (vCPU put path), or if the vCPU is being reloaded
> after being scheduled out (vCPU load path).  In the short term, this will
> allow dropping kvm_arch_sched_in(), as arch code can query scheduled_out
> during kvm_arch_vcpu_load().
> 
> Longer term, scheduled_out opens up other potential optimizations, without
> creating subtle/brittle dependencies.  E.g. it allows KVM to keep guest
> state (that is managed via kvm_arch_vcpu_{load,put}()) loaded across
> kvm_sched_{out,in}(), if KVM knows the state isn't accessed by the host
> kernel.  Forcing arch code to coordinate between kvm_arch_sched_{in,out}()
> and kvm_arch_vcpu_{load,put}() is awkward, not reusable, and relies on the
> exact ordering of calls into arch code.
> 
> Adding scheduled_out also obviates the need for a kvm_arch_sched_out()
> hook, e.g. if arch code needs to do something novel when putting vCPU
> state.
> 
> And even if KVM never uses scheduled_out for anything beyond dropping
> kvm_arch_sched_in(), just being able to remove all of the arch stubs makes
> it worth adding the flag.
> 
> Link: https://lore.kernel.org/all/20240430224431.490139-1-sea...@google.com
> Cc: Oliver Upton 
> Signed-off-by: Sean Christopherson 

Reviewed-by: Oliver Upton 

-- 
Thanks,
Oliver


Re: [PATCH v2 2/2] powerpc: hotplug driver bridge support

2024-05-20 Thread Oliver O'Halloran
On Fri, May 17, 2024 at 9:15 PM krishna kumar  wrote:
>
> > Uh, if I'm reading this right it looks like your "slot" C5 is actually
> > the PCIe switch's internal bus which is definitely not hot pluggable.
>
> It's a hotplug slot. Please see the snippet below:
>
> :~$ sudo lspci -vvv -s 0004:02:00.0 | grep --color HotPlug
>  SltCap:AttnBtn- PwrCtrl+ MRL- AttnInd- PwrInd- HotPlug+ Surprise-
> :~$
>
> :~$ sudo lspci -vvv -s 0004:02:01.0 | grep --color HotPlug
> SltCap:AttnBtn- PwrCtrl+ MRL- AttnInd- PwrInd- HotPlug+ Surprise-
> :~$
>
> :~$ sudo lspci -vvv -s 0004:02:02.0 | grep --color HotPlug
> SltCap:AttnBtn- PwrCtrl+ MRL- AttnInd- PwrInd- HotPlug+ Surprise-
> :~$
>
> :~$ sudo lspci -vvv -s 0004:02:03.0 | grep --color HotPlug
> SltCap:AttnBtn- PwrCtrl+ MRL- AttnInd- PwrInd- HotPlug+ Surprise-
> :~$

All this is showing is that the switch downstream ports on bus 0004:02
have a slot capability. I already know that (see what I said
previously about physical links). The fact the downstream ports have a
slot capability also has absolutely nothing to do with anything I was
saying. Look at the lspci output for 0004:01:00.0 which is the
switch's upstream port. The upstream port device will not have a slot
capability because it's a bridge into the virtual PCI bus that is
internal to the switch.

> It seems like your explanation about the missing 0004:01:00.0 may be
> correct and could be due to a firmware bug. However, the scope of this
> patch does not relate to this issue. Additionally, if it starts with
> 0004:01:00.0 to 0004:01:03.0, the behavior of hot-unplug and hot-plug
> operations will remain inconsistent. This patch aims to address the
> inconsistent behavior of hot-unplug and hot-plug.
>
> *snip*
>
> > It might be worth adding some logic to pnv_php to verify the PCI
> > bridge upstream of the slot actually has the PCIe slot capability to
> > guard against this problem.
>
> We can have a look at this problem in another patch.

The point of this series is to fix the behaviour of pnv_php, is it
not? Powering off a PCI(e) slot is supposed to render it safe to
remove the card  in that slot. Currently if you "power off" C5, the
kernel is still going to have active references to the switch's
upstream port device (at 0004:01:00.0) and the switch management
function (at 0004:01:00.1). If the kernel has active references to PCI
devices physically located in the slot we supposedly powered off, then
the hotplug driver isn't doing its job. The asymmetry between hot add
and removal that you're trying to fix here is a side effect of the
fact that pnv_php is advertising the wrong thing as a slot. I think
you should stop pnv_php from advertising something as a slot when it's
not actually a slot because that's the root of all your problems.

> We wanted to handle the more generic case and did not want to be confined to
> only one device assumption. We want to fix the current inconsistent behavior
> more generically.

Right, as I said above I don't think handing the more generic case is
actually required if pnv_php is doing its job properly. It doesn't
hurt though.

> Regarding the fix, the fix is obvious:

really?

> We have to traverse
> and find the bridge ports from DT and invoke  pci_scan_slot() on them. This 
> will
> discover and create the entry for bridge ports (0004:02:00.0 to 0004:02:00.3 
> on
> the given bus- 0004:02). There is already an existing function, 
> pci_scan_bridge()
> which is doing invocation of pci_scan_slot () for the devices behind the 
> bridge,
> in this case for  SAS device. So eventually, we are doing a scan of all the 
> entities
> behind the slot.

I already read your patch so I'm not sure why you feel the need to
re-describe it in tedious detail.

> Would you like me to combine the non-bridge and bridge cases into one? I can 
> attempt
> to do this. Hopefully, if we incorporate the iterate sibling logic case 
> correctly,
> we may not need to maintain these two separate cases for bridge and 
> non-bridge. I
> will attempt this, and if it works, I will include it in the next patch. 
> Thanks.

Yes, do that.

Also, do not post HTML emails to linux development lists. It breaks
plain text inline quoting which makes your messages annoying to reply
to. Some linux development lists will also silently drop HTML emails.
Please talk to the other LTC engineers about how to set up your mail
client to send plain text emails to avoid these problems in the
future.

Oliver


Re: [PATCH v2 2/2] powerpc: hotplug driver bridge support

2024-05-16 Thread Oliver O'Halloran
On Tue, May 14, 2024 at 11:54 PM Krishna Kumar  wrote:
>
> There is an issue with the hotplug operation when it's done on the
> bridge/switch slot. The bridge-port and devices behind the bridge, which
> become offline by hot-unplug operation, don't get hot-plugged/enabled by
> doing hot-plug operation on that slot. Only the first port of the bridge
> gets enabled and the remaining port/devices remain unplugged. The hot
> plug/unplug operation is done by the hotplug driver
> (drivers/pci/hotplug/pnv_php.c).
>
> Root Cause Analysis: This behavior is due to missing code for the DPC
> switch/bridge.

I don't see anything touching DPC in this series?

> *snip*
>
> Command for reproducing the issue :
>
> For hot unplug/disable - echo 0 > /sys/bus/pci/slots/C5/power
> For hot plug/enable -echo 1 > /sys/bus/pci/slots/C5/power
>
> where C5 is slot associated with bridge.
>
> Scenario/Tests:
> Output of lspci -nn before test is given below. This snippet contains
> devices used for testing on Powernv machine.
>
> 0004:02:00.0 PCI bridge [0604]: PMC-Sierra Inc. Device [11f8:4052]
> 0004:02:01.0 PCI bridge [0604]: PMC-Sierra Inc. Device [11f8:4052]
> 0004:02:02.0 PCI bridge [0604]: PMC-Sierra Inc. Device [11f8:4052]
> 0004:02:03.0 PCI bridge [0604]: PMC-Sierra Inc. Device [11f8:4052]
> 0004:08:00.0 Serial Attached SCSI controller [0107]:
> Broadcom / LSI SAS3216 PCI-Express Fusion-MPT SAS-3 [1000:00c9] (rev 01)
> 0004:09:00.0 Serial Attached SCSI controller [0107]:
> Broadcom / LSI SAS3216 PCI-Express Fusion-MPT SAS-3 [1000:00c9] (rev 01)
>
> Output of lspci -tv before test is as follows:
>
> # lspci -tv
>  +-[0004:00]---00.0-[01-0e]--+-00.0-[02-0e]--+-00.0-[03-07]--
>  |   |   +-01.0-[08]00.0  Broadcom / 
> LSI SAS3216 PCI-Express Fusion-MPT SAS-3
>  |   |   +-02.0-[09]00.0  Broadcom / 
> LSI SAS3216 PCI-Express Fusion-MPT SAS-3
>  |   |   \-03.0-[0a-0e]--
>  |   \-00.1  PMC-Sierra Inc. Device 4052
>
> C5(bridge) and C6(End Point) slot address are as below:
> # cat /sys/bus/pci/slots/C5/address
> 0004:02:00
> # cat /sys/bus/pci/slots/C6/address
> 0004:09:00

Uh, if I'm reading this right it looks like your "slot" C5 is actually
the PCIe switch's internal bus which is definitely not hot pluggable.
I find it helps to look at the PCI topology in terms of where the
physical PCIe links are. Here we've got:

- A link between the PHB (0004:00:00.0) and the switch upstream port
(0004:01:00.0)
- A link from switch downstream port 0 (0004:02:00.0) to nothing
- A link from switch downstream port 1 (0004:02:01.0) to a SAS card
- A link from switch downstream port 2 (0004:02:02.0) to a SAS card
- A link from switch downstream port 2 (0004:02:03.0) to nothing

Note that there's no PCIe link between the switch upstream port
(0004:01:00.0) and the downstream ports on bus 0004:02. The connection
between those is invisible to us because it's custom bus logic
internal to the PCIe switch ASIC. What I think has happened here is
that system firmware has supplied bad PCIe slot information to OPAL
which has resulted in pnv_php advertising a slot in the wrong place.
Assuming this following the usual IBM convention I'd expect the bridge
device for C5 to be the PHB's root port and the bus should be 0004:01.
It might be worth adding some logic to pnv_php to verify the PCI
bridge upstream of the slot actually has the PCIe slot capability to
guard against this problem.

> Hot-unplug operation on slot associated with bridge:
> # echo 0 > /sys/bus/pci/slots/C5/power
> # lspci -tv
>  +-[0004:00]---00.0-[01-0e]--+-00.0-[02-0e]--
>  |   \-00.1  PMC-Sierra Inc. Device 4052

Yep, "powering off" C5 doesn't remove the upstream port device. This
would create problems if you physically removed the card from C5 since
the kernel would assume the switch device is still present.

> *snip*


> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
> index 38561d6a2079..bea612759832 100644
> --- a/arch/powerpc/kernel/pci_dn.c
> +++ b/arch/powerpc/kernel/pci_dn.c
> @@ -493,4 +493,36 @@ static void pci_dev_pdn_setup(struct pci_dev *pdev)
> pdn = pci_get_pdn(pdev);
> pdev->dev.archdata.pci_data = pdn;
>  }
> +
> +void pci_traverse_sibling_nodes_and_scan_slot(struct device_node *start, 
> struct pci_bus *bus)
> +{
> +   struct device_node *dn;
> +   int slotno;
> +
> +   u32 class = 0;
> +
> +   if (!of_property_read_u32(start->child, "class-code", )) {
> +   /* Call of pci_scan_slot for non-bridge/EP case */
> +   if (!((class >> 8) == PCI_CLASS_BRIDGE_PCI)) {
> +   slotno = PCI_SLOT(PCI_DN(start->child)->devfn);
> +   pci_scan_slot(bus, PCI_DEVFN(slotno, 0));
> +   return;
> +   }
> +   }
> +
> +   /* Iterate all siblings */
> +   

Re: [PATCH 0/4] KVM: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()

2024-05-01 Thread Oliver Upton
On Wed, May 01, 2024 at 07:28:21AM -0700, Sean Christopherson wrote:
> On Wed, May 01, 2024, Oliver Upton wrote:
> > On Tue, Apr 30, 2024 at 12:31:53PM -0700, Sean Christopherson wrote:
> > > Drop kvm_arch_sched_in() and instead pass a @sched_in boolean to
> > > kvm_arch_vcpu_load().
> > > 
> > > While fiddling with an idea for optimizing state management on AMD CPUs,
> > > I wanted to skip re-saving certain host state when a vCPU is scheduled 
> > > back
> > > in, as the state (theoretically) shouldn't change for the task while it's
> > > scheduled out.  Actually doing that was annoying and unnecessarily brittle
> > > due to having a separate API for the kvm_sched_in() case (the state save
> > > needed to be in kvm_arch_vcpu_load() for the common path).
> > > 
> > > E.g. I could have set a "temporary"-ish flag somewhere in kvm_vcpu, but 
> > > (a)
> > > that's gross and (b) it would rely on the arbitrary ordering between
> > > sched_in() and vcpu_load() staying the same.
> > 
> > Another option would be to change the rules around kvm_arch_sched_in()
> > where the callee is expected to load the vCPU context.
> > 
> > The default implementation could just call kvm_arch_vcpu_load() directly
> > and the x86 implementation can order things the way it wants before
> > kvm_arch_vcpu_load().
> > 
> > I say this because ...
> > 
> > > The only real downside I see is that arm64 and riscv end up having to pass
> > > "false" for their direct usage of kvm_arch_vcpu_load(), and passing 
> > > boolean
> > > literals isn't ideal.  But that can be solved by adding an inner helper 
> > > that
> > > omits the @sched_in param (I almost added a patch to do that, but I 
> > > couldn't
> > > convince myself it was necessary).
> > 
> > Needing to pass @sched_in for other usage of kvm_arch_vcpu_load() hurts
> > readability, especially when no other architecture besides x86 cares
> > about it.
> 
> Yeah, that bothers me too.
> 
> I tried your suggestion of having x86's kvm_arch_sched_in() do 
> kvm_arch_vcpu_load(),
> and even with an added kvm_arch_sched_out() to provide symmetry, the x86 code 
> is
> kludgy, and even the common code is a bit confusing as it's not super obvious
> that kvm_sched_{in,out}() is really just kvm_arch_vcpu_{load,put}().
> 
> Staring a bit more at the vCPU flags we have, adding a "bool scheduled_out" 
> isn't
> terribly gross if it's done in common code and persists across load() and 
> put(),
> i.e. isn't so blatantly a temporary field.  And because it's easy, it could be
> set with WRITE_ONCE() so that if it can be read cross-task if there's ever a
> reason to do so.
> 
> The x86 code ends up being less ugly, and adding future arch/vendor code for
> sched_in() *or* sched_out() requires minimal churn, e.g. arch code doesn't 
> need
> to override kvm_arch_sched_in().
> 
> The only weird part is that vcpu->preempted and vcpu->ready have slightly
> different behavior, as they are cleared before kvm_arch_vcpu_load().  But the
> weirdness is really with those flags no having symmetry, not with 
> scheduled_out
> itself.
> 
> Thoughts?

Yeah, this seems reasonable. Perhaps scheduled_out could be a nice hint
for guardrails / sanity checks in the future.

-- 
Thanks,
Oliver


Re: [PATCH 0/4] KVM: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()

2024-04-30 Thread Oliver Upton
On Tue, Apr 30, 2024 at 12:31:53PM -0700, Sean Christopherson wrote:
> Drop kvm_arch_sched_in() and instead pass a @sched_in boolean to
> kvm_arch_vcpu_load().
> 
> While fiddling with an idea for optimizing state management on AMD CPUs,
> I wanted to skip re-saving certain host state when a vCPU is scheduled back
> in, as the state (theoretically) shouldn't change for the task while it's
> scheduled out.  Actually doing that was annoying and unnecessarily brittle
> due to having a separate API for the kvm_sched_in() case (the state save
> needed to be in kvm_arch_vcpu_load() for the common path).
> 
> E.g. I could have set a "temporary"-ish flag somewhere in kvm_vcpu, but (a)
> that's gross and (b) it would rely on the arbitrary ordering between
> sched_in() and vcpu_load() staying the same.

Another option would be to change the rules around kvm_arch_sched_in()
where the callee is expected to load the vCPU context.

The default implementation could just call kvm_arch_vcpu_load() directly
and the x86 implementation can order things the way it wants before
kvm_arch_vcpu_load().

I say this because ...

> The only real downside I see is that arm64 and riscv end up having to pass
> "false" for their direct usage of kvm_arch_vcpu_load(), and passing boolean
> literals isn't ideal.  But that can be solved by adding an inner helper that
> omits the @sched_in param (I almost added a patch to do that, but I couldn't
> convince myself it was necessary).

Needing to pass @sched_in for other usage of kvm_arch_vcpu_load() hurts
readability, especially when no other architecture besides x86 cares
about it.

-- 
Thanks,
Oliver


Re: [PATCH] KVM: Get rid of return value from kvm_arch_create_vm_debugfs()

2024-02-23 Thread Oliver Upton
On Fri, 16 Feb 2024 15:59:41 +, Oliver Upton wrote:
> The general expectation with debugfs is that any initialization failure
> is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows
> implementations to return an error and kvm_create_vm_debugfs() allows
> that to fail VM creation.
> 
> Change to a void return to discourage architectures from making debugfs
> failures fatal for the VM. Seems like everyone already had the right
> idea, as all implementations already return 0 unconditionally.
> 
> [...]

Applied to kvmarm/next, thanks!

[1/1] KVM: Get rid of return value from kvm_arch_create_vm_debugfs()
  https://git.kernel.org/kvmarm/kvmarm/c/284851ee5cae

--
Best,
Oliver


[PATCH] KVM: Get rid of return value from kvm_arch_create_vm_debugfs()

2024-02-16 Thread Oliver Upton
The general expectation with debugfs is that any initialization failure
is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows
implementations to return an error and kvm_create_vm_debugfs() allows
that to fail VM creation.

Change to a void return to discourage architectures from making debugfs
failures fatal for the VM. Seems like everyone already had the right
idea, as all implementations already return 0 unconditionally.

Signed-off-by: Oliver Upton 
---

Compile-tested on arm64, powerpc, and x86 since I don't trust myself to
even get this simple patch right.

 arch/powerpc/kvm/powerpc.c | 3 +--
 arch/x86/kvm/debugfs.c | 3 +--
 include/linux/kvm_host.h   | 2 +-
 virt/kvm/kvm_main.c| 8 ++--
 4 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 23407fbd73c9..d32abe7fe6ab 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -2538,9 +2538,8 @@ void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, 
struct dentry *debugfs_
vcpu->kvm->arch.kvm_ops->create_vcpu_debugfs(vcpu, 
debugfs_dentry);
 }
 
-int kvm_arch_create_vm_debugfs(struct kvm *kvm)
+void kvm_arch_create_vm_debugfs(struct kvm *kvm)
 {
if (kvm->arch.kvm_ops->create_vm_debugfs)
kvm->arch.kvm_ops->create_vm_debugfs(kvm);
-   return 0;
 }
diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index 95ea1a1f7403..999227fc7c66 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -189,9 +189,8 @@ static const struct file_operations mmu_rmaps_stat_fops = {
.release= kvm_mmu_rmaps_stat_release,
 };
 
-int kvm_arch_create_vm_debugfs(struct kvm *kvm)
+void kvm_arch_create_vm_debugfs(struct kvm *kvm)
 {
debugfs_create_file("mmu_rmaps_stat", 0644, kvm->debugfs_dentry, kvm,
_rmaps_stat_fops);
-   return 0;
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7e7fd25b09b3..9a45f673f687 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1507,7 +1507,7 @@ bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
 bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
-int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+void kvm_arch_create_vm_debugfs(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 10bfc88a69f7..c681149c382a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1150,10 +1150,7 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const 
char *fdname)
_fops_per_vm);
}
 
-   ret = kvm_arch_create_vm_debugfs(kvm);
-   if (ret)
-   goto out_err;
-
+   kvm_arch_create_vm_debugfs(kvm);
return 0;
 out_err:
kvm_destroy_vm_debugfs(kvm);
@@ -1183,9 +1180,8 @@ void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm)
  * Cleanup should be automatic done in kvm_destroy_vm_debugfs() recursively, so
  * a per-arch destroy interface is not needed.
  */
-int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
+void __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
 {
-   return 0;
 }
 
 static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)

base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
-- 
2.44.0.rc0.258.g7320e95886-goog



Re: PowerNV PCIe hotplug support?

2024-01-11 Thread Oliver O'Halloran
On Thu, Dec 28, 2023 at 4:49 PM Timothy Pearson
 wrote:
>
> Hrmm, potentially related, I'm getting a kernel oops when I try to hot unplug 
> the entire PCIe switch from the PHB4:

IIRC there's hard to fix races in the pci core around removal of pci
slot devices. Maybe someone fixed those since I tried last, but I'd
just avoid doing that if I were you.


Re: PowerNV PCIe hotplug support?

2024-01-11 Thread Oliver O'Halloran
On Thu, Dec 28, 2023 at 3:16 PM Timothy Pearson
 wrote:
>
> I've been evaluating some new options for our POWER9-based hardware in the 
> NVMe space, and would like some clarification on the current status of PCIe 
> hotplug for the PowerNV platforms.
>
> From what I understand, the pnv_php driver provides the basic hotplug 
> functionality on PowerNV.  What I'm not clear on is to what extent this is 
> intended to flow downstream to attached PCIe switches.

I did a bunch of work on NVMe hotplug back in the day and it worked
fine then. Most of that work was done with Gen3 PLX switches though
which are considerably dumber than the newer ones though.

> I have a test setup here that consists of a PMC 8533 switch and several 
> downstream NVMe drives, with the switch attached directly to the PHB4 root 
> port.  After loading the pnv_php module, I can disconnect the downstream NVMe 
> devices by either using echo 0 on /sys/bus/pcu/slots/Snnn/power, or by 
> doing a physical surprise unplug, however nothing I try can induce a newly 
> plugged device to train and be detected on the bus.  Even trying a echo 0 and 
> subsequent echo 1 to /sys/bus/pcu/slots/Snnn/power only results in the 
> device going offline, there seems to be no way to bring the device back 
> online short of a reboot.
>
> Hotplug of other devices connected directly to the PHB4 appears to work 
> properly (I can online and offline via the power node); the issue seems to be 
> restricted to downstream devices connected to the (theoretically hotplug 
> capable) PMC 8533 switch.

I'd suspect either the PCIe interrupt is not being delivered for some
reason (EEH might be isolating the PCIe switch port?) or the removal
is triggering downstream port containment on the switch. Check the
port capability status with lspci. IIRC pnv_php doesn't know anything
about DPC so you might need to have skiboot disable that by default to
keep the kernel happy.

> Is this the intended behavior for downstream (non-IBM) PCIe ports?  Raptor 
> can provide resources to assist in a fix if needed, but I would like to 
> understand if this is a bug or an unimplemented feature first, and if the 
> latter what the main issues are likely to be in implementation.

It *should* work and the WARN()/BUG() spew you're seeing are bugs that
just need fixing. That said, hotplug on PNV is a headache for a bunch
of reasons most of which are due to EEH. Something I was working
towards before I left IBM was refactoring how EEH worked internally so
we could eliminate the need for pnv_php and go back to using the
standard pcieport driver. Unfortunately, that's a big job and I can't
even remember how much of that work actually made it upstream.


Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-26 Thread Oliver Sang
hi, Linus,

On Sun, Nov 26, 2023 at 03:20:58PM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
>  wrote:
> >
> > IOW, I might have messed up some "trivial cleanup" when prepping for
> > sending it out...
> 
> Bah. Famous last words. One of the "trivial cleanups" made the code
> more "obvious" by renaming the nospec mask as just "mask".
> 
> And that trivial rename broke that patch *entirely*, because now that
> name shadowed the "fmode_t" mask argument.
> 
> Don't even ask how long it took me to go from "I *tested* this,
> dammit, now it doesn't work at all" to "Oh God, I'm so stupid".
> 
> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

we applied the new patch upon 0ede61d858, and confirmed regression is gone,
even 3.4% better than 93faf426e3 now.

Tested-by: kernel test robot 

=
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  
gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit:
  93faf426e3 ("vfs: shave work on failed file open")
  0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")
  c712b4365b ("Improve __fget_files_rcu() code generation (and thus 
__fget_light())")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 c712b4365b5b4dbe1d1380edd37
 --- ---
 %stddev %change %stddev %change %stddev
 \  |\  |\
228481 ±  4%  -4.6% 217900 ±  6% -11.7% 201857 ±  5%  
meminfo.DirectMap4k
 89056-2.0%  87309-1.6%  87606
proc-vmstat.nr_slab_unreclaimable
 16.28-0.7%  16.16-1.0%  16.12
turbostat.RAMWatt
  0.01 ±  9%  +58125.6%   4.17 ±175%  +23253.5%   1.67 ±222%  
perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
781.67 ± 10%  +6.5% 832.50 ± 19% -14.3% 670.17 ±  4%  
perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
 97958 ±  7%  -9.7%  88449 ±  4%  -0.6%  97399 ±  4%  
sched_debug.cpu.avg_idle.stddev
  0.00 ± 12% +24.2%   0.00 ± 17%  -5.2%   0.00 ±  7%  
sched_debug.cpu.next_balance.stddev
   6391048-2.9%6208584+3.4%6605584
will-it-scale.16.threads
399440-2.9% 388036+3.4% 412848
will-it-scale.per_thread_ops
   6391048-2.9%6208584+3.4%6605584
will-it-scale.workload
 19.99 ±  4%  -2.2   17.74+1.2   21.18 ±  2%  
perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
  1.27 ±  5%  +0.82.11 ±  3% +31.1   32.36 ±  2%  
perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
 32.69 ±  4%  +5.0   37.70   -32.70.00
perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
  0.00   +27.9   27.85+0.00.00
perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
 20.00 ±  4%  -2.3   17.75+0.4   20.43 ±  2%  
perf-profile.children.cycles-pp.fput
  0.24 ± 10%  -0.10.18 ±  2%  -0.10.18 ± 10%  
perf-profile.children.cycles-pp.syscall_return_via_sysret
  1.48 ±  5%  +0.51.98 ±  3% +30.8   32.32 ±  2%  
perf-profile.children.cycles-pp.__fdget
 31.85 ±  4%  +6.0   37.86   -31.80.00
perf-profile.children.cycles-pp.__fget_light
  0.00   +27.7   27.67+0.00.00
perf-profile.children.cycles-pp.__get_file_rcu
 30.90 ±  4% -20.6   10.35 ±  2% -30.90.00
perf-profile.self.cycles-pp.__fget_light
 19.94 ±  4%  -2.4   17.53-0.3   19.62 ±  2%  
perf-profile.self.cycles-pp.fput
  9.81 ±  4%  -2.47.42 ±  2%  +1.7   11.51 ±  4%  
perf-profile.self.cycles-pp.do_poll
  0.23 ± 11%  -0.10.17 ±  4%  -0.10.18 ± 11%  
perf-profile.self.cycles-pp.syscall_return_via_sysret
  0.44 ±  7%  +0.00.45 ±  5%  +0.10.52 ±  4%  
perf-profile.self.cycles-pp.__poll
  0.85 ±  4%  +0.10.92 ±  3% +30.3   31.17 ±  2%  
perf-profile.self.cycles-pp.__fdget
  0.00   +26.5   26.48+0.00.00
perf-profile.self.cycles-pp.__get_file_rcu
 2.146e+10 ±  2%  +8.5%  2.329e+10 ±  2%  -2.1%  2.101e+10
perf-stat.i.branch-instructions
   

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-26 Thread Oliver O'Halloran
On Wed, Sep 27, 2023 at 9:03 AM Bjorn Helgaas  wrote:
>
> On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote:
> > ...
>
> > Actually, this is a question from my colleague from firmware team.
> > The original question is that:
> >
> > "Should I set CPER_SEV_FATAL for Generic Error Status Block when a
> > PCIe fatal error is detected? If set, kernel will always panic.
> > Otherwise, kernel will always not panic."
> >
> > So I pull a question about desired behavior of Linux kernel first :)
> > From the perspective of the kernel, CPER_SEV_FATAL for Generic Error
> > Status Block is not reasonable. The kernel will attempt to recover
> > Fatal errors, although recovery may fail.
>
> I don't know the semantics of CPER_SEV_FATAL or why it's there.
> With CPER, we have *two* error severities: a "native" one defined by
> the PCIe spec and another defined by the platform via CPER.
>
> I speculate that the reason for the CPER severity could be to provide
> a severity for error sources that don't have a "native" severity like
> AER does, or for the vendor to force the OS to restart (for
> CPER_SEV_FATAL, anyway) in cases where it might not otherwise.
>
> In the native case, we only have the PCIe severity and don't have the
> CPER severity at all, and I suspect that unless there's uncontained
> data corruption, we would rather handle even the most severe PCIe
> fatal error by disabling the specific device(s) instead of panicking
> and restarting the whole machine.

>From a user's point of view disabling a device is often worse than a
reboot. If you get a fatal error from a system's only network card
then disabling the card may result in the system being uncontactable
until someone manually recovers it. Similarly, if the disk hosting the
root filesystem disappears the system may not crash immediately (most
of what it needs will be in page cache), but there's no guarantee that
it can do anything useful in that state. In both cases forcing a
reboot will probably bring the system back into a usable state.

> So for PCIe errors, I'm not sure setting CPER_SEV_FATAL is beneficial
> unless the platform wants to force the OS to panic, e.g., maybe the
> platform knows about data corruption and/or the vendor wants the OS to
> panic as part of a reliability story.

The PCIe spec is (intentionally?) vague about the causes of fatal
errors. For all we know a device is reporting one because the embedded
OS it was running crashed and as a side effect it's been DMAing junk
into system memory for the past hour. If you know something about the
device in question maybe you can make those assumptions, but in
general it's impossible to assess the actual severity of an error.
Even in the case of a noisy link causing bit flips (it's possible,
LCRC is only 16bit and ECEC is optional) if we get corruption of the
address bits of the TLP header then the DMA might have overwritten
something important to the OS. From a hardware vendor's point of view
just forcing a reboot makes a lot of sense.

Paranoia aside, in a lot of cases PCI device errors are nothing major
and can be resolved by just restarting the device. However, there's no
way for generic kernel code to make that assessment and we probably
shouldn't have the kernel guess. I'd say the safest option would be to
punt that decision to userspace and provide some way to whitelist
devices that we can ignore errors from. I'm not familiar enough with
the ACPI to know if we have enough details in the notification it
sends to actually implement that though.

Oliver


Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-24 Thread Oliver O'Halloran
On Fri, Sep 22, 2023 at 8:23 AM David Laight  wrote:
>
> > It would be nice if they worked the same, but I suspect that vendors
> > may rely on the fact that CPER_SEV_FATAL forces a restart/panic as
> > part of their system integrity story.
>
> The file system errors created by a panic (especially an NMI panic)
> could easily be more problematic than a failed PCIe data transfer.
> Evan a read that returned ~0u - which can be checked for.
>
> Panicking a system that is converting TDM telephony to RTP for the
> 911 emergency service because a PCIe cable/riser connecting one of the
> TDM board has become loose doesn't seem ideal.

For kernel native AER the default reaction to errors is
reset-and-reinit which probably isn't much better for your case.
Sounds like you would want a knob to suppress everything except error
reporting so you can handle it in userspace?

> (Or because the TDM board's fpga has decided it isn't going to respond
> to any accesses until the BARs are setup again...)
>
> The system can carry on with some TDM connections disabled - but that
> is ok because they are all duplicated in case a cable gets cuit.

Well that's a relief :)

Oliver


Re: [PATCH RESEND v2] KVM: move KVM_CAP_DEVICE_CTRL to the generic check

2023-06-14 Thread Oliver Upton
On Tue, Jun 13, 2023 at 02:16:16PM -0700, Sean Christopherson wrote:
> +
> 
> Please use scripts/get_maintainer.pl to generate the To/Cc lists.  This may be
> trivial, but it still needs eyeballs from the relevant maintainers.

+1000. I'd buy someone a beer if they made a bot that just ran
get_maintainer on patches that hit the list :)

> On Wed, Mar 15, 2023, Wei Wang wrote:
> > KVM_CAP_DEVICE_CTRL allows userspace to check if the kvm_device
> > framework (e.g. KVM_CREATE_DEVICE) is supported by KVM. Move
> > KVM_CAP_DEVICE_CTRL to the generic check for the two reasons:
> > 1) it already supports arch agnostic usages (i.e. KVM_DEV_TYPE_VFIO).
> > For example, userspace VFIO implementation may needs to create
> > KVM_DEV_TYPE_VFIO on x86, riscv, or arm etc. It is simpler to have it
> > checked at the generic code than at each arch's code.
> > 2) KVM_CREATE_DEVICE has been added to the generic code.
> > 
> > Link: 
> > https://lore.kernel.org/all/20221215115207.14784-1-wei.w.w...@intel.com
> > Signed-off-by: Wei Wang 
> > Reviewed-by: Sean Christopherson 
> > ---
> >  arch/arm64/kvm/arm.c   | 1 -
> >  arch/powerpc/kvm/powerpc.c | 1 -
> >  arch/riscv/kvm/vm.c| 1 -
> >  arch/s390/kvm/kvm-s390.c   | 1 -
> >  virt/kvm/kvm_main.c| 1 +
> >  5 files changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 3bd732eaf087..96329e675771 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -202,7 +202,6 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> > ext)
> > r = vgic_present;
> > break;
> > case KVM_CAP_IOEVENTFD:
> > -   case KVM_CAP_DEVICE_CTRL:
> > case KVM_CAP_USER_MEMORY:
> > case KVM_CAP_SYNC_MMU:
> > case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:

for arm64:

Reviewed-by: Oliver Upton 

-- 
Thanks,
Oliver


Re: [RFC 0/3] Asynchronous EEH recovery

2023-06-12 Thread Oliver O'Halloran
On Tue, Jun 13, 2023 at 11:44 AM Ganesh Goudar  wrote:
>
> Hi,
>
> EEH recovery is currently serialized and these patches shorten
> the time taken for EEH recovery by making the recovery to run
> in parallel. The original author of these patches is Sam Bobroff,
> I have rebased and tested these patches.
>
> On powervm with 64 VFs from same PHB,  I see approximately 48%
> reduction in time taken in EEH recovery.
>
> On powernv with 9 network cards, Where 2 cards installed on one
> PHB and 1 card on each of the rest of the PHBs, Providing 20 PFs
> in total. I see approximately 33% reduction in time taken in EEH
> recovery.
>
> These patches were originally posted as separate RFCs by Sam, And
> I rebased and posted these patches almost a year back, I stopped
> pursuing these patches as I was not able test this on powernv, Due
> to the issues in drivers of cards I was testing this on, Which are
> now resolved. Since I am re-posting this after long time, Posting
> this as a fresh RFC, Please comment.

What changes have you made since the last time you posted this series?
If the patches are the same then the comments I posted last time still
apply.


Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

2023-05-31 Thread Oliver Upton
On Wed, May 31, 2023 at 05:10:52PM -0600, Yu Zhao wrote:
> On Wed, May 31, 2023 at 1:28 PM Oliver Upton  wrote:
> > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton  
> > > wrote:
> > > > As it is currently implemented, yes. But, there's potential to fast-path
> > > > the implementation by checking page_count() before starting the walk.
> > >
> > > Do you mind posting another patch? I'd be happy to ack it, as well as
> > > the one you suggested above.
> >
> > I'd rather not take such a patch independent of the test_clear_young
> > series if you're OK with that. Do you mind implementing something
> > similar to the above patch w/ the proposed optimization if you need it?
> 
> No worries. I can take the above together with the following, which
> would form a new series with its own merits, since apparently you
> think the !AF case is important.

Sorry if my suggestion was unclear.

I thought we were talking about ->free_removed_table() being called from
the stage-2 unmap path, in which case we wind up unnecessarily visiting
PTEs on a table known to be empty. You could fast-path that by only
initiating a walk if  page_count() > 1:

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 95dae02ccc2e..766563dc465c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1331,7 +1331,8 @@ void kvm_pgtable_stage2_free_removed(struct 
kvm_pgtable_mm_ops *mm_ops, void *pg
.end= kvm_granule_size(level),
};
 
-   WARN_ON(__kvm_pgtable_walk(, mm_ops, ptep, level + 1));
+   if (mm_ops->page_count(pgtable) > 1)
+   WARN_ON(__kvm_pgtable_walk(, mm_ops, ptep, level + 1));
 
WARN_ON(mm_ops->page_count(pgtable) != 1);
mm_ops->put_page(pgtable);


A lock-free access fault walker is interesting, but in my testing it hasn't
led to any significant improvements over acquiring the MMU lock for
read. Because of that I hadn't bothered with posting the series upstream.

-- 
Thanks,
Oliver


Re: [PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young()

2023-05-31 Thread Oliver Upton
Hi Yu,

On Fri, May 26, 2023 at 05:44:30PM -0600, Yu Zhao wrote:
> Implement kvm_arch_test_clear_young() to support the fast path in
> mmu_notifier_ops->test_clear_young().
> 
> It focuses on a simple case, i.e., hardware sets the accessed bit in
> KVM PTEs and VMs are not protected, where it can rely on RCU and
> cmpxchg to safely clear the accessed bit without taking
> kvm->mmu_lock. Complex cases fall back to the existing slow path
> where kvm->mmu_lock is then taken.
> 
> Signed-off-by: Yu Zhao 
> ---
>  arch/arm64/include/asm/kvm_host.h |  6 ++
>  arch/arm64/kvm/mmu.c  | 36 +++
>  2 files changed, 42 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 7e7e19ef6993..da32b0890716 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -1113,4 +1113,10 @@ static inline void kvm_hyp_reserve(void) { }
>  void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
>  bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
>  
> +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> +static inline bool kvm_arch_has_test_clear_young(void)
> +{
> + return cpu_has_hw_af() && !is_protected_kvm_enabled();
> +}

I would *strongly* suggest you consider supporting test_clear_young on
systems that do software Access Flag management. FEAT_HAFDBS is an
*optional* extension to the architecture, so we're going to support
software AF management for a very long time in KVM. It is also a valid
fallback option in the case of hardware errata which render HAFDBS
broken.

So, we should expect (and support) systems of all shapes and sizes that
do software AF. I'm sure we'll hear about more in the not-too-distant
future...

For future reference (even though I'm suggesting you support software
AF), decisions such of these need an extremely verbose comment
describing the rationale behind the decision.

> +
>  #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c3b3e2afe26f..26a8d955b49c 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c

Please do not implement page table walkers outside of hyp/pgtable.c

> @@ -1678,6 +1678,42 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct 
> kvm_gfn_range *range)
>  range->start << PAGE_SHIFT);
>  }
>  
> +static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
> +enum kvm_pgtable_walk_flags flags)
> +{
> + kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
> +
> + VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));

This sort of sanity checking is a bit excessive. Isn't there a risk of
false negatives here too? IOW, if we tragically mess up RCU in the page
table code, what's stopping a prematurely freed page from being
allocated to another user?

> + if (!kvm_pte_valid(new))
> + return 0;
> +
> + if (new == ctx->old)
> + return 0;
> +
> + if (kvm_should_clear_young(ctx->arg, ctx->addr / PAGE_SIZE))
> + stage2_try_set_pte(ctx, new);
> +
> + return 0;
> +}
> +
> +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> + u64 start = range->start * PAGE_SIZE;
> + u64 end = range->end * PAGE_SIZE;
> + struct kvm_pgtable_walker walker = {
> + .cb = stage2_test_clear_young,
> + .arg= range,
> + .flags  = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_SHARED,
> + };
> +
> + BUILD_BUG_ON(is_hyp_code());

Delete this assertion.

> + kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, );
> +
> + return false;
> +}
> +
>  phys_addr_t kvm_mmu_get_httbr(void)
>  {
>   return __pa(hyp_pgtable->pgd);
> -- 
> 2.41.0.rc0.172.g3f132b7071-goog
> 

-- 
Thanks,
Oliver


Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

2023-05-31 Thread Oliver Upton
On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> On Tue, May 30, 2023 at 1:37 PM Oliver Upton  wrote:
> >
> > Hi Yu,
> >
> > On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> > > On Sat, May 27, 2023 at 12:08 PM Oliver Upton  
> > > wrote:
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct 
> > > > kvm_pgtable_visit_ctx *ctx,
> > > >
> > > > kvm_granule_size(ctx->level));
> > > >
> > > > if (childp)
> > > > -   mm_ops->put_page(childp);
> > > > +   mm_ops->free_removed_table(childp, ctx->level);
> > >
> > > Thanks, Oliver.
> > >
> > > A couple of things I haven't had the chance to verify -- I'm hoping
> > > you could help clarify:
> > > 1. For unmapping, with free_removed_table(), wouldn't we have to look
> > > into the table we know it's empty unnecessarily?
> >
> > As it is currently implemented, yes. But, there's potential to fast-path
> > the implementation by checking page_count() before starting the walk.
> 
> Do you mind posting another patch? I'd be happy to ack it, as well as
> the one you suggested above.

I'd rather not take such a patch independent of the test_clear_young
series if you're OK with that. Do you mind implementing something
similar to the above patch w/ the proposed optimization if you need it?

-- 
Thanks,
Oliver


Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

2023-05-30 Thread Oliver Upton
Hi Yu,

On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> On Sat, May 27, 2023 at 12:08 PM Oliver Upton  wrote:
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct 
> > kvm_pgtable_visit_ctx *ctx,
> >
> > kvm_granule_size(ctx->level));
> >
> > if (childp)
> > -   mm_ops->put_page(childp);
> > +   mm_ops->free_removed_table(childp, ctx->level);
> 
> Thanks, Oliver.
> 
> A couple of things I haven't had the chance to verify -- I'm hoping
> you could help clarify:
> 1. For unmapping, with free_removed_table(), wouldn't we have to look
> into the table we know it's empty unnecessarily?

As it is currently implemented, yes. But, there's potential to fast-path
the implementation by checking page_count() before starting the walk.

> 2. For remapping and unmapping, how does free_removed_table() put the
> final refcnt on the table passed in? (Previously we had
> put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd
> have to do something equivalent with free_removed_table().)

Heh, that's a bug, and an embarrassing one at that!

Sent out a fix for that, since it would appear we leak memory on
table->block transitions. PTAL if you have a chance.

https://lore.kernel.org/all/20230530193213.1663411-1-oliver.up...@linux.dev/

-- 
Thanks,
Oliver


Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

2023-05-27 Thread Oliver Upton
Yu,

On Fri, May 26, 2023 at 05:44:29PM -0600, Yu Zhao wrote:
> Stage2 page tables are currently not RCU safe against unmapping or VM
> destruction. The previous mmu_notifier_ops members rely on
> kvm->mmu_lock to synchronize with those operations.
> 
> However, the new mmu_notifier_ops member test_clear_young() provides
> a fast path that does not take kvm->mmu_lock. To implement
> kvm_arch_test_clear_young() for that path, unmapped page tables need
> to be freed by RCU and kvm_free_stage2_pgd() needs to be after
> mmu_notifier_unregister().
> 
> Remapping, specifically stage2_free_removed_table(), is already RCU
> safe.
> 
> Signed-off-by: Yu Zhao 
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  2 ++
>  arch/arm64/kvm/arm.c |  1 +
>  arch/arm64/kvm/hyp/pgtable.c |  8 ++--
>  arch/arm64/kvm/mmu.c | 17 -
>  4 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
> b/arch/arm64/include/asm/kvm_pgtable.h
> index ff520598b62c..5cab52e3a35f 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 
> level)
>   * @put_page:Decrement the refcount on a page. When 
> the
>   *   refcount reaches 0 the page is automatically
>   *   freed.
> + * @put_page_rcu:RCU variant of the above.

You don't need to add yet another hook to implement this. I was working
on lock-free walks in a separate context and arrived at the following:

commit f82d264a37745e07ee28e116c336f139f681fd7f
Author: Oliver Upton 
Date:   Mon May 1 08:53:37 2023 +

KVM: arm64: Consistently use free_removed_table() for stage-2

free_removed_table() is essential to the RCU-protected parallel walking
scheme, as behind the scenes the cleanup is deferred until an RCU grace
period. Nonetheless, the stage-2 unmap path calls put_page() directly,
which leads to table memory being freed inline with the table walk.

This is safe for the time being, as the stage-2 unmap walker is called
while holding the write lock. A future change to KVM will further relax
the locking mechanics around the stage-2 page tables to allow lock-free
walkers protected only by RCU. As such, switch to the RCU-safe mechanism
for freeing table memory.

Signed-off-by: Oliver Upton 

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3d61bd3e591d..bfbebdcb4ef0 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct 
kvm_pgtable_visit_ctx *ctx,
   kvm_granule_size(ctx->level));
 
if (childp)
-   mm_ops->put_page(childp);
+   mm_ops->free_removed_table(childp, ctx->level);
 
return 0;
 }

-- 
Thanks,
Oliver


Re: [PATCH v2 0/4] KVM: Refactor KVM stats macros and enable custom stat names

2023-04-26 Thread Oliver Upton
On Mon, Mar 06, 2023 at 11:01:52AM -0800, David Matlack wrote:
> This series refactors the KVM stats macros to reduce duplication and
> adds the support for choosing custom names for stats.
> 
> Custom name makes it possible to decouple the userspace-visible stat
> names from their internal representation in C. This can allow future
> commits to refactor the various stats structs without impacting
> userspace tools that read KVM stats.
> 
> This also allows stats to be stored in data structures such as arrays,
> without needing unions to access specific stats. Case in point, the last
> patch in this series removes the pages_{4k,2m,1g} union, which is a
> useful cleanup to prepare for sharing paging code across architectures
> [1].
> 
> And for full transparency, another motivation for this series it that at
> Google we have several out-of-tree stats that use arrays. Custom name
> support is something we added internally and it reduces our technical
> debt to get the support merged upstream.

For the series:

Reviewed-by: Oliver Upton 

-- 
Thanks,
Oliver


Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

2023-02-17 Thread Oliver Upton
return false;

Same here...

> + /* see the comments on kvm_pgtable_walk_flags */
> + rcu_read_lock();
> +
> + kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, );
> +
> + rcu_read_unlock();

The rcu_read_{lock,unlock}() is entirely superfluous.

Of course, it is somewhat hidden by the fact that we must use
abstractions to support host and EL2 use of the page table code, but we
already make use of RCU to protect the stage-2 of a 'regular' VM.

> + return true;
> +}
> +
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>   if (!kvm->arch.mmu.pgt)
> @@ -1848,7 +1924,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
>  
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  {
> - kvm_free_stage2_pgd(>arch.mmu);
>  }

Doesn't this become a blatant correctness issue? This entirely fails to
uphold the exact expectations of the call.

-- 
Thanks,
Oliver


Re: [PATCH v8 3/3] ASoC: SOF: Fix deadlock when shutdown a frozen userspace

2022-12-01 Thread Oliver Neukum

On 01.12.22 14:03, Ricardo Ribalda wrote:

Hi,
 

This patchset does not modify this behaviour. It simply fixes the
stall for kexec().

The  patch that introduced the stall:
83bfc7e793b5 ("ASoC: SOF: core: unregister clients and machine drivers
in .shutdown")


That patch is problematic. I would go as far as saying that
it needs to be reverted.


was sent as a generalised version of:
https://github.com/thesofproject/linux/pull/3388

AFAIK, we would need a similar patch for every single board which
I am not sure it is doable in a reasonable timeframe.

On the meantime this seems like a decent compromises. Yes, a
miss-behaving userspace can still stall during suspend, but that was
not introduced in this patch.


Well, I mean if you know what wrong then I'd say at least return to
a sanely broken state.

The whole approach is wrong. You need to be able to deal with user
space talking to removed devices by returning an error and keeping
the resources association with the open file allocated until
user space calls close()

Regards
    Oliver





Re: [PATCH v8 3/3] ASoC: SOF: Fix deadlock when shutdown a frozen userspace

2022-12-01 Thread Oliver Neukum

On 01.12.22 12:08, Ricardo Ribalda wrote:

If we are shutting down due to kexec and the userspace is frozen, the
system will stall forever waiting for userspace to complete.

Do not wait for the clients to complete in that case.


Hi,

I am afraid I have to state that this approach is bad in every case,
not just this corner case. It basically means that user space can stall
the kernel for an arbitrary amount of time. And we cannot have that.

Regards
Oliver



Re: [RFC 0/3] Asynchronous EEH recovery

2022-08-17 Thread Oliver O'Halloran
On Tue, Aug 16, 2022 at 1:29 PM Ganesh Goudar  wrote:
>
> Hi,
>
> EEH reocvery is currently serialized and these patches shorten
> the time taken for EEH recovery by making the recovery to run
> in parallel. The original author of these patches is Sam Bobroff,
> I have rebased and tested these patches.
>
> On powervm with 64 VFs and I see approximately 48% reduction
> in time taken in EEH recovery,

Can you elaborate on how you're testing this? A VM with 64 separate
VFs on 64 separate PHBs I guess? How are you injecting errors?

> Yet to be tested on powernv.

If you're not testing on powernv you might as well not be testing. EEH
support on pseries is relatively straightforward since managing PCI
bridges, etc is all hidden away in the hypervisor. Most of the things
about EEH which give me headaches simply don't happen on pseries since
the PCI topology seen by the guest is flat.

> These patches were originally posted as separate RFCs, I think
> posting them as single series would be more helpful, I know the
> patches are too big, I will try to logically divide in next
> iterations.
>
> Thanks
>
> Ganesh Goudar (3):
>   powerpc/eeh: Synchronization for safety

I didn't pick that patch up after Sam left for a reason. I don't
recall the exact details, but I think I decided that the lock being
added had the same (or mostly the same) scope as the pci rescan lock.
All modifications to the EEH PE tree have to occur with the rescan
lock held since the per-device EEH setup occurs while configuring the
pci_dev (similarly the teardown happens when we remove the pci_dev
from its bus). I don't think that was true when Same wrote the patch
originally though since it pre-dates my big EEH init re-work.

It's possible (probable even!) I got that wrong, or that it was
something I was trying to make true through re-works rather than an
accurate assessment of the current state of things. I'll try page some
of this back into my brain later. I think I left some comments on
Sam's original RFC patch which might still be valid.

>   powerpc/eeh: Asynchronous recovery

I remember not hating the idea, but I think the implementation leaves
a lot to be desired. If you only really care about pseries then just
moving to a model where each PHB has its own recovery thread would get
you most of the benefit without needing to turn the already
complicated recovery path into an async trainwreck. IIRC Sam's
motivation for that patch was to make recovery faster for powernv
systems when an error was injected on a PF with a lot of child VFs.
Certain drivers took forever to recover each VF (might have been mlx5)
so doing it in parallel would have sped things up considerably.

Oliver


Re: [PATCH] MAINTAINERS: Remove myself as EEH maintainer

2022-08-11 Thread Oliver O'Halloran
On Thu, Aug 11, 2022 at 4:22 PM Michael Ellerman  wrote:
>
> Russell Currey  writes:
> > I haven't touched EEH in a long time I don't have much knowledge of the
> > subsystem at this point either, so it's misleading to have me as a
> > maintainer.
>
> Thank you for your service.
>
> > I remain grateful to Oliver for picking up my slack over the years.
>
> Ack.
>
> But I wonder if he is still happy being listed as the only maintainer.
> Given the status is "Supported" that means "Someone is actually paid to
> look after this" - and I suspect Oracle are probably not paying him to
> do that?

I'm still happy to field questions and/or give reviews occasionally if
needed, but yeah I don't have the time, hardware, or inclination to do
any actual maintenance. IIRC Mahesh was supposed to take over
supporting EEH after I left IBM. If he's still around he should
probably be listed as a maintainer.


Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-28 Thread Oliver O'Halloran
On Fri, Jul 29, 2022 at 12:21 PM Alexey Kardashevskiy  wrote:
>
> *snip*
>
> About this. If a platform has a concept of explicit DMA windows (2 or
> more), is it one domain with 2 windows or 2 domains with one window each?
>
> If it is 2 windows, iommu_domain_ops misses windows manipulation
> callbacks (I vaguely remember it being there for embedded PPC64 but
> cannot find it quickly).
>
> If it is 1 window per a domain, then can a device be attached to 2
> domains at least in theory (I suspect not)?
>
> On server POWER CPUs, each DMA window is backed by an independent IOMMU
> page table. (reminder) A window is a bus address range where devices are
> allowed to DMA to/from ;)

I've always thought of windows as being entries to a top-level "iommu
page table" for the device / domain. The fact each window is backed by
a separate IOMMU page table shouldn't really be relevant outside the
arch/platform.


Re: [PATCH] powerpc/powernv/pci: Drop VF MPS fixup

2022-05-19 Thread Oliver O'Halloran
On Thu, May 19, 2022 at 10:38 PM Michael Ellerman  wrote:
>
> Christophe Leroy  writes:
> > Le 02/09/2020 à 05:51, Oliver O'Halloran a écrit :
> >> The MPS field in the VF config space is marked as reserved in current
> >> versions of the SR-IOV spec. In other words, this fixup doesn't do
> >> anything.
> >>
> >> Signed-off-by: Oliver O'Halloran 
> >
> > A lot of cleanup patches from Oliver were merged in Septembre 2020 but
> > not this one.
> >
> > Any reason ?
>
> It wasn't clear to me that it's safe to remove. The commit that added it
> seemed to think it was important.
>
> The fact that it's out-of-spec doesn't mean we don't have some hardware
> somewhere that relies on that.

There is no hardware that depends on it. It was added in response to a
bug report on the IBM internal bugzilla about virtual functions not
reporting the same MPS as the physical function in the output of
lspci. This is by design since MPS is a property that is only relevant
to the PF. There was a corresponding patch to skiboot to intercept
writes to the MPS field of VFs which was used to fake a writable MPS
field in firmware. I removed that hack in 2019
(https://github.com/open-power/skiboot/commit/22057f868f3b2b1fd02647a738f6da0858b5eb6c)
since it was pointless and was causing other problems. There's no real
reason to keep this code around IMO.


Re: [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.

2021-11-24 Thread Oliver O'Halloran
On Wed, Nov 24, 2021 at 12:05 AM Mahesh Salgaonkar  wrote:
>
> *snip*
>
> This causes the EEH handler to get stuck for ~6
> seconds before it could notify that the pci error has been detected and
> stop any active operations. Hence with running I/O traffic, during this 6
> seconds, the network driver continues its operation and hits a timeout
> (netdev watchdog).On timeouts, network driver go into ffdc capture mode
> and reset path assuming the PCI device is in fatal condition. This causes
> EEH recovery to fail and sometimes it leads to system hang or crash.

Whatever is causing that crash is the real issue IMO. PCI error
reporting is fundamentally asynchronous and the driver always has to
tolerate some amount of latency between the error occuring and being
reported. Six seconds is admittedly an eternity, but it should not
cause a system crash under any circumstances. Printing a warning due
to a timeout is annoying, but it's not the end of the world.


Re: [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.

2021-11-24 Thread Oliver O'Halloran
On Wed, Nov 24, 2021 at 7:45 PM Mahesh J Salgaonkar
 wrote:
>
> No it doesn't. We will still do a presence check before the recovery
> process starts. This patch moves the check after notifying the driver to
> stop active I/O operations. If a presence check finds the device isn't
> present, we will skip the EEH recovery. However, on a surprise hotplug,
> the user will see the EEH messages on the console before it finds there
> is nothing to recover.

Suppressing the spurious EEH messages was part of why I added that
check in the first place. If you want to defer the presence check
until later you should move the stack trace printing, etc to after
we've confirmed there are still devices present. Considering the
motivation for this patch is to avoid spurious warnings from the
driver I don't think printing spurious EEH messages is much of an
improvement.

The other option would be returning an error from the pseries hotplug
driver. IIRC that's what pnv_php / OPAL does if the PHB is fenced and
we can't check the slot presence state.


[PATCH] pci: Rename pcibios_add_device to match

2021-09-13 Thread Oliver O'Halloran
The general convention for pcibios_* hooks is that they're named after
the corresponding pci_* function they provide a hook for. The exception
is pcibios_add_device() which provides a hook for pci_device_add(). This
has been irritating me for years so rename it.

Also, remove the export of the microblaze version. The only caller
must be compiled as a built-in so there's no reason for the export.

Signed-off-by: Oliver O'Halloran 
---
 arch/microblaze/pci/pci-common.c   | 3 +--
 arch/powerpc/kernel/pci-common.c   | 2 +-
 arch/powerpc/platforms/powernv/pci-sriov.c | 2 +-
 arch/s390/pci/pci.c| 2 +-
 arch/sparc/kernel/pci.c| 2 +-
 arch/x86/pci/common.c  | 2 +-
 drivers/pci/pci.c  | 4 ++--
 drivers/pci/probe.c| 4 ++--
 include/linux/pci.h| 2 +-
 9 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index 557585f1be41..622a4867f9e9 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -587,13 +587,12 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
 }
 DECLARE_PCI_FIXUP_HEADER(PCI_ANY_ID, PCI_ANY_ID, pcibios_fixup_resources);
 
-int pcibios_add_device(struct pci_dev *dev)
+int pcibios_device_add(struct pci_dev *dev)
 {
dev->irq = of_irq_parse_and_map_pci(dev, 0, 0);
 
return 0;
 }
-EXPORT_SYMBOL(pcibios_add_device);
 
 /*
  * Reparent resource children of pr that conflict with res
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index c3573430919d..6749905932f4 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1059,7 +1059,7 @@ void pcibios_bus_add_device(struct pci_dev *dev)
ppc_md.pcibios_bus_add_device(dev);
 }
 
-int pcibios_add_device(struct pci_dev *dev)
+int pcibios_device_add(struct pci_dev *dev)
 {
struct irq_domain *d;
 
diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 28aac933a439..486c2937b159 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -54,7 +54,7 @@
  * to "new_size", calculated above. Implementing this is a convoluted process
  * which requires several hooks in the PCI core:
  *
- * 1. In pcibios_add_device() we call pnv_pci_ioda_fixup_iov().
+ * 1. In pcibios_device_add() we call pnv_pci_ioda_fixup_iov().
  *
  *At this point the device has been probed and the device's BARs are sized,
  *but no resource allocations have been done. The SR-IOV BARs are sized
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index e7e6788d75a8..ded3321b7208 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -561,7 +561,7 @@ static void zpci_cleanup_bus_resources(struct zpci_dev 
*zdev)
zdev->has_resources = 0;
 }
 
-int pcibios_add_device(struct pci_dev *pdev)
+int pcibios_device_add(struct pci_dev *pdev)
 {
struct zpci_dev *zdev = to_zpci(pdev);
struct resource *res;
diff --git a/arch/sparc/kernel/pci.c b/arch/sparc/kernel/pci.c
index 9c2b720bfd20..31b0c1983286 100644
--- a/arch/sparc/kernel/pci.c
+++ b/arch/sparc/kernel/pci.c
@@ -1010,7 +1010,7 @@ void pcibios_set_master(struct pci_dev *dev)
 }
 
 #ifdef CONFIG_PCI_IOV
-int pcibios_add_device(struct pci_dev *dev)
+int pcibios_device_add(struct pci_dev *dev)
 {
struct pci_dev *pdev;
 
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 3507f456fcd0..9e1e6b8d8876 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -632,7 +632,7 @@ static void set_dev_domain_options(struct pci_dev *pdev)
pdev->hotplug_user_indicators = 1;
 }
 
-int pcibios_add_device(struct pci_dev *dev)
+int pcibios_device_add(struct pci_dev *dev)
 {
struct pci_setup_rom *rom;
struct irq_domain *msidom;
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ce2ab62b64cf..c63598c1cdd8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2091,14 +2091,14 @@ void pcim_pin_device(struct pci_dev *pdev)
 EXPORT_SYMBOL(pcim_pin_device);
 
 /*
- * pcibios_add_device - provide arch specific hooks when adding device dev
+ * pcibios_device_add - provide arch specific hooks when adding device dev
  * @dev: the PCI device being added
  *
  * Permits the platform to provide architecture specific functionality when
  * devices are added. This is the default implementation. Architecture
  * implementations can override this.
  */
-int __weak pcibios_add_device(struct pci_dev *dev)
+int __weak pcibios_device_add(struct pci_dev *dev)
 {
return 0;
 }
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index d9fc02a71baa..2ba43b6adf31 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2450,7 +2450,7 @@ static struct irq_domain *pci_dev_msi_domain(s

Re: [PATCH 1/1] powerpc: Drop superfluous pci_dev_is_added() calls

2021-09-13 Thread Oliver O'Halloran
On Sat, Sep 11, 2021 at 9:09 PM Michael Ellerman  wrote:
>
> Niklas Schnelle  writes:
> > On powerpc, pci_dev_is_added() is called as part of SR-IOV fixups
> > that are done under pcibios_add_device() which in turn is only called in
> > pci_device_add() whih is called when a PCI device is scanned.
>
> Thanks for cleaning this up for us.
>
> > Now pci_dev_assign_added() is called in pci_bus_add_device() which is
> > only called after scanning the device. Thus pci_dev_is_added() is always
> > false and can be dropped.
>
> My only query is whether we can pin down when that changed.
>
> Oliver said:
>
>   The use of pci_dev_is_added() in arch/powerpc was because in the past
>   pci_bus_add_device() could be called before pci_device_add(). That was
>   fixed a while ago so It should be safe to remove those calls now.
>
> I trawled back through the history a bit but I can't remember/find which
> commit changed that, Oliver can you remember?

Yeah, on closer inspection that never happened. The re-ordering I was
thinking of was when the boot-time BAR assignments were moved in
3f068aae7a95 so they'd always occur before pci_bus_add_device() was
called. I think I got that change mixed up with commit 30d87ef8b38d
("powerpc/pci: Fix pcibios_setup_device() ordering") which moved some
of what what pcibios_device_add() did into pcibios_bus_add_device() to
harmonise the hot and coldplug paths.

As far as I can tell the pci_dev_is_added() check has been pointless
since the code was added in 6e628c7d33d9 ("powerpc/powernv: Reserve
additional space for IOV BAR according to the number of total_pe").
Even back then pci_device_add() was called first in both the normal
and OF based PCI probing paths so there's no circumstance where that
code would see the added flag set.

That patch was part of the PowerNV SRIOV support series which went
through quite a few iterations. My best guess is that check might have
been needed in an earlier version and was just carried forward until
it got merged. I didn't dig too deeply into the history though.

Reviewed-by: Oliver O'Halloran 


Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-07 Thread Oliver O'Halloran
On Tue, Sep 7, 2021 at 10:21 PM Niklas Schnelle  wrote:
>
> On Tue, 2021-09-07 at 10:45 +0200, Niklas Schnelle wrote:
> > On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> > > On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle  
> > > wrote:
> > > > Patch 3 I already sent separately resulting in the discussion below but 
> > > > without
> > > > a final conclusion.
> > > >
> > > > https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/
> > > >
> > > > I believe even though there were some doubts about the use of
> > > > pci_dev_is_added() by arch code the existing uses as well as the use in 
> > > > the
> > > > final patch of this series warrant this export.
> > >
> > > The use of pci_dev_is_added() in arch/powerpc was because in the past
> > > pci_bus_add_device() could be called before pci_device_add(). That was
> > > fixed a while ago so It should be safe to remove those calls now.
> >
> > Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
> > I can certainly sent a patch for that. This would then leave only the
> > existing use in s390 which I added because of a dead lock prevention
> > and explained here:
> > https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.ca...@linux.ibm.com/
> >
> > Plus the need to use it in the recovery code of this series. I think in
> > the EEH code the need for a similar check is alleviated by the checks
> > in the beginning of
> > arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
> > eeh_slot_presence_check() which checks presence via the hotplug slot.
> > I guess we could use our own state tracking in a similar way but felt
> > like pci_dev_is_added() is the more logical choice.

The slot check is mainly there to prevent attempts to "recover"
devices that have been surprise removed (i.e NVMe hot-unplug). The
actual recovery process operates off the eeh_pe tree which is frozen
in place when an error is detected. If a pci_dev is added or removed
it's not really a problem since those are only ever looked at when
notifying drivers which is done with the rescan_remove lock held. That
said, I wouldn't really encourage anyone to follow the EEH model since
it's pretty byzantine.

> Looking into this again, I think we actually can't easily track this
> state ourselves outside struct pci_dev. The reason for this is that
> when e.g. arch/s390/pci/pci_sysfs.c:recover_store() removes the struct
> pci_dev and scans it again the new struct pci_dev re-uses the same
> struct zpci_dev because from a platform point of view the PCI device
> was never removed but only disabled and re-enabled. Thus we can only
> distinguish the stale struct pci_dev by looking at things stored in
> struct pci_dev itself.

IMO the real problem is removing and re-adding the pci_dev. I think
it's something that's done largely because the PCI core doesn't really
provide any better mechanism for getting a device back into a
known-good state so it's abused to implement error recovery. This is
something that's always annoyed me since it conflates recovery with
hotplug. After a hot-(un)plug we might have a different device or no
device. In the recovery case we expect to start and end with the same
device. Why not apply the same logic to the pci_dev?

Something I was tinkering with before I left IBM was re-working the
way EEH handles recovering devices that don't have a driver with error
handling callbacks to something like:

1. unbind the driver
2. pci_save_state()
3. do the reset
4. pci_restore_state()
5. re-bind the driver

That would allow keeping the pci_dev around and let me delete a pile
of confusing code which handles binding the eeh_dev to the new
pci_dev. The obvious problem with that approach is the assumption the
device is functional enough to allow saving the config space, but I
don't think that's a deal breaker. We could stash a copy of the device
state before we allow drivers to attach and use that to restore the
device after the reset. The end result would be the same known-good
state that we'd get after a re-scan.

> That said, I think for the recovery case we might be able to drop the
> pci_dev_is_added() and rely on pdev->driver != NULL which we check
> anyway and that should catch any PCI device that was already removed.

Would that work if there was an error on a device without a driver
bound? If you're just trying to stop races between recovery and device
removal then pci_dev_is_added() is probably the right tool for the
job. Trying to substitute it with a proxy seems like a bad idea.


Re: [PATCH] pci/hotplug/pnv-php: Remove probable double put

2021-09-07 Thread Oliver O'Halloran
On Wed, Sep 8, 2021 at 8:02 AM Tyrel Datwyler  wrote:
>
> On 9/7/21 1:59 AM, Xu Wang wrote:
> > Device node iterators put the previous value of the index variable,
> > so an explicit put causes a double put.
> >
> > Signed-off-by: Xu Wang 
> > ---
> >  drivers/pci/hotplug/pnv_php.c | 1 -
> >  1 file changed, 1 deletion(-)
> >
> > diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
> > index 04565162a449..ed4d1a2c3f22 100644
> > --- a/drivers/pci/hotplug/pnv_php.c
> > +++ b/drivers/pci/hotplug/pnv_php.c
> > @@ -158,7 +158,6 @@ static void pnv_php_detach_device_nodes(struct 
> > device_node *parent)
> >   for_each_child_of_node(parent, dn) {
> >   pnv_php_detach_device_nodes(dn);
> >
> > - of_node_put(dn);
> >   of_detach_node(dn);
>
> Are you sure this is a double put? This looks to me like its meant to drive 
> tear
> down of the device by putting a long term reference and not the short term get
> that is part of the iterator.

Yeah, the put is there is to drop the initial ref so the node can be
released. It might be worth adding a comment.


Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-06 Thread Oliver O'Halloran
On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle  wrote:
>
> Patch 3 I already sent separately resulting in the discussion below but 
> without
> a final conclusion.
>
> https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/
>
> I believe even though there were some doubts about the use of
> pci_dev_is_added() by arch code the existing uses as well as the use in the
> final patch of this series warrant this export.

The use of pci_dev_is_added() in arch/powerpc was because in the past
pci_bus_add_device() could be called before pci_device_add(). That was
fixed a while ago so It should be safe to remove those calls now.

> Patch 4 "PCI: Export pci_dev_lock()" is basically an extension to commit
> e3a9b1212b9d ("PCI: Export pci_dev_trylock() and pci_dev_unlock()") which
> already exported pci_dev_trylock(). In the final patch we make use of
> pci_dev_lock() to wait for any other exclusive uses of the pdev to be finished
> before starting recovery.

Hmm, I noticed the EEH
(arch/powerpc/kernel/eeh_driver.c:eeh_pe_report_edev())  and the
generic PCIe error recovery code (see
drivers/pci/pcie/err.c:report_error_detected()) only call
device_lock() before entering the driver's error handling callbacks. I
wonder if they should be using pci_dev_lock() instead. The only real
difference is that pci_dev_lock() will also block user space from
accessing the device's config space while error recovery is in
progress which seems sensible enough.


Re: [PATCH] powerpc/chrp: Revert "Move PHB discovery" and "Make hydra_init() static"

2021-07-18 Thread Oliver O'Halloran
On Sun, Jul 18, 2021 at 2:14 AM Guenter Roeck  wrote:
>
> On Sun, Jul 18, 2021 at 01:54:23AM +1000, Oliver O'Halloran wrote:
> > On Sat, Jul 17, 2021 at 8:12 AM Guenter Roeck  wrote:
> > >
> > > This patch reverts commit 407d418f2fd4 ("powerpc/chrp: Move PHB
> > > discovery") and commit 9634afa67bfd ("powerpc/chrp: Make hydra_init()
> > > static").
> > >
> > > Running the upstream kernel on Qemu's brand new "pegasos2" emulation
> > > results in a variety of backtraces such as
> >
> > ...and actually using it appears to require both manually enabling it
> > in the qemu config and finding a random bios blob that is no longer
> > distributed by the manufacturer. Cool.
> >
> That is absolutely wrong. vof.bin provided by qemu in the linux root
> directory works just fine, plus chrp32_defconfig minus SMP.

It looks like I forgot to actually merge the current HEAD into my
local qemu branch which was a few weeks old. VOF support was merged on
the 13th so I didn't see it. My apologies.

Oliver


Re: [PATCH] powerpc/chrp: Revert "Move PHB discovery" and "Make hydra_init() static"

2021-07-18 Thread Oliver O'Halloran
On Sun, Jul 18, 2021 at 2:24 AM Guenter Roeck  wrote:
>
> On Sat, Jul 17, 2021 at 05:57:50PM +0200, Christophe Leroy wrote:
> > Guenter Roeck  a écrit :
> >
> > > This patch reverts commit 407d418f2fd4 ("powerpc/chrp: Move PHB
> > > discovery") and commit 9634afa67bfd ("powerpc/chrp: Make hydra_init()
> > > static").
> > >
> > > Running the upstream kernel on Qemu's brand new "pegasos2" emulation
> > > results in a variety of backtraces such as
> > >
> > > Kernel attempted to write user page (a1) - exploit attempt? (uid: 0)
> > > [ cut here ]
> > > Bug: Write fault blocked by KUAP!
> > > WARNING: CPU: 0 PID: 0 at arch/powerpc/mm/fault.c:230
> > > do_page_fault+0x4f4/0x920
> > > CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.2 #40
> > > NIP:  c0021824 LR: c0021824 CTR: 
> > > REGS: c1085d50 TRAP: 0700   Not tainted  (5.13.2)
> > > MSR:  00021032   CR: 24042254  XER: 
> > >
> > > GPR00: c0021824 c1085e10 c0f8c520 0021 3fffefff c1085c60 c1085c58
> > > 
> > > GPR08: 1032   c0ffb3ec 44042254  
> > > 0004
> > > GPR16:   00c4 00d0 0188c6e0 01006000 0001
> > > 40b14000
> > > GPR24: c0ec000c 0300 0200  4200 00a1 
> > > c1085e60
> > > NIP [c0021824] do_page_fault+0x4f4/0x920
> > > LR [c0021824] do_page_fault+0x4f4/0x920
> > > Call Trace:
> > > [c1085e10] [c0021824] do_page_fault+0x4f4/0x920 (unreliable)
> > > [c1085e50] [c0004254] DataAccess_virt+0xd4/0xe4
> > >
> > > and the system fails to boot. Bisect points to commit 407d418f2fd4
> > > ("powerpc/chrp: Move PHB discovery"). Reverting this patch together with
> > > commit 9634afa67bfd ("powerpc/chrp: Make hydra_init() static") fixes
> > > the problem.
> >
> > Isn't there more than that in the backtrace ? If there is a fault blocked by
> > Kuap, it means there is a fault. It should be visible in the traces.
> >
> > Should we fix the problem instead of reverting the commit that made the
> > problem visible ?
> >
>
> I do not think the patch reverted here made the problem visible. I am
> quite sure that it introduced it. AFAIS the problem is that the new code
> initializes and remaps PCI much later, after it is being used.

Right. The bug is that on 32bit platforms the PHB setup also maps one
of the PHB's IO space as "ISA IO space" as a side effect. There's a
handful of platforms (pegasos2 is one) which use an i8259 interrupt
controller and configuring that requires access to IO / ISA space. The
KUAP faults we're setting are because isa_io_base is still set to zero
so outb() and friends are accessing the zero page.

I don't think there's any real reason why we need to have PCI fully
set up to handle that situation. A few platforms already have early
fixup code which parses the DT directly rather than using the fields
of pci_controller (which are parsed from the DT anyway) and I'm pretty
sure we can do something similar.

> Also, the
> patch introducing the problem was never tested on real hardware (it even
> says so in the patch comments). That by itself seems to be quite
> problematic for such an invasive patch, and makes me wonder if some of
> the other PHB discovery related patches introduced similar problems.

The legacy platforms are maintained on a best-effort basis. Ellerman's
CI farm covers most of the powerpc CPU types, but there's no real way
to test the bulk of the platforms in the tree since most of the
hardware is currently in landfill.

> Anyway, I do not use or have that hardware. I was just playing with the
> latest version of qemu and ended up tracking down why its brand new
> pegasos2 emulation no longer boots with the latest Linux kernel.
> I personally don't care too much if ppc/chrp support is broken in the
> upstream kernel or not. Please take this patch as problem report,
> and feel free to do with it whatever you like, including ignoring it.

Problem reports are fine and appreciated. I'd be less cranky if you
included the kernel config you used in the initial report since I
wasted an hour of my saturday trying to replicate it with various
kernel configs that had SMP enabled since that's what the
chrp_defconfig uses.

Oliver


Re: [PATCH] powerpc/chrp: Revert "Move PHB discovery" and "Make hydra_init() static"

2021-07-17 Thread Oliver O'Halloran
On Sat, Jul 17, 2021 at 8:12 AM Guenter Roeck  wrote:
>
> This patch reverts commit 407d418f2fd4 ("powerpc/chrp: Move PHB
> discovery") and commit 9634afa67bfd ("powerpc/chrp: Make hydra_init()
> static").
>
> Running the upstream kernel on Qemu's brand new "pegasos2" emulation
> results in a variety of backtraces such as

...and actually using it appears to require both manually enabling it
in the qemu config and finding a random bios blob that is no longer
distributed by the manufacturer. Cool.

> Kernel attempted to write user page (a1) - exploit attempt? (uid: 0)
> [ cut here ]
> Bug: Write fault blocked by KUAP!
> WARNING: CPU: 0 PID: 0 at arch/powerpc/mm/fault.c:230 
> do_page_fault+0x4f4/0x920
> CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.2 #40
> NIP:  c0021824 LR: c0021824 CTR: 
> REGS: c1085d50 TRAP: 0700   Not tainted  (5.13.2)
> MSR:  00021032   CR: 24042254  XER: 
>
> GPR00: c0021824 c1085e10 c0f8c520 0021 3fffefff c1085c60 c1085c58 
> GPR08: 1032   c0ffb3ec 44042254   0004
> GPR16:   00c4 00d0 0188c6e0 01006000 0001 40b14000
> GPR24: c0ec000c 0300 0200  4200 00a1  c1085e60
> NIP [c0021824] do_page_fault+0x4f4/0x920
> LR [c0021824] do_page_fault+0x4f4/0x920
> Call Trace:
> [c1085e10] [c0021824] do_page_fault+0x4f4/0x920 (unreliable)
> [c1085e50] [c0004254] DataAccess_virt+0xd4/0xe4
>
> and the system fails to boot. Bisect points to commit 407d418f2fd4
> ("powerpc/chrp: Move PHB discovery"). Reverting this patch together with
> commit 9634afa67bfd ("powerpc/chrp: Make hydra_init() static") fixes
> the problem.

The rationale for adding ppc_md.discover_phbs() and shifting all the
platforms over to using it is in commit 5537fcb319d0 ("powerpc/pci:
Add ppc_md.discover_phbs()"). I'd rather not go back to having random
platforms doing their PCI init before the kernel has setup the page
allocator. You need to either debug the problem fully, or provide
enough replication details so that someone who isn't invested in
emulating ancient hardware (i.e. me) with enough information to
actually replicate the problem.

Oliver


Re: [PATCH] Documentation: PCI: pci-error-recovery: rearrange the general sequence

2021-06-18 Thread Oliver O'Halloran
On Fri, Jun 18, 2021 at 4:05 PM Wesley Sheng  wrote:
>
> Reset_link() callback function was called before mmio_enabled() in
> pcie_do_recovery() function actually, so rearrange the general
> sequence betwen step 2 and step 3 accordingly.

I don't think this is true in all cases. If pcie_do_recovery() is
called with state==pci_channel_io_normal (i.e. non-fatal AER) the link
won't be reset. EEH (ppc PCI error recovery thing) also uses
.mmio_enabled() as described.


Re: [PATCH] powerpc: Fix kernel-jump address for ppc64 wrapper boot

2021-06-08 Thread Oliver O'Halloran
On Tue, Jun 8, 2021 at 4:33 PM He Ying  wrote:
>
> Hello,
>
> 在 2021/6/8 13:26, Oliver O'Halloran 写道:
> > On Fri, Jun 4, 2021 at 7:39 PM He Ying  wrote:
> >>  From "64-bit PowerPC ELF Application Binary Interface Supplement 1.9",
> >> we know that the value of a function pointer in a language like C is
> >> the address of the function descriptor and the first doubleword
> >> of the function descriptor contains the address of the entry point
> >> of the function.
> >>
> >> So, when we want to jump to an address (e.g. addr) to execute for
> >> PPC-elf64abi, we should assign the address of addr *NOT* addr itself
> >> to the function pointer or system will jump to the wrong address.
> > How have you tested this?
>
> I tested ppc64-elf big-endian. I changed the Kconfig so that ppc64
> big-endian
>
> selects PPC64_WRAPPER_BOOT. I used qemu to run the cuImage and found
>
> the problem. It made me confused. By applying this patch, I found it works.
>
> I thought it works for ppc64le too. So I upstream this patch.
>
> >
> > IIRC the 64bit wrapper is only used for ppc64le builds. For that case
> > the current code is work because the LE ABI (ABIv2) doesn't use
> > function descriptors. I think even for a BE kernel we need the current
> > behaviour because the vmlinux's entry point is screwed up (i.e.
> > doesn't point a descriptor) and tools in the wild (probably kexec)
> > expect it to be screwed up.
>
> Yes, you're right. PPC64_WRAPPER_BOOT is only used for ppc64le builds
> currently.
>
> LE ABI (ABI v2) doesn't use function descriptors. Is that right? I don't
> test that. If so,
>
> this patch should be dropped. But why does ppc64 have different ABIs? So
> strange.

Yeah, it is strange. When LE support was added the toolchain team took
the opportunity to revamp the ABI since BE and LE binaries were never
going to be compatible. IIRC there is a slight performance advantage
to using v2 since function descriptors added an extra load when
performing a non-local function call. I think.

> If the wrapper is built to ppc64be, my patch is tested right. The entry
> point in the ELF
>
> header is always right so you can assign the header->e_entry to the
> function pointer
>
> and then jump to the entry by calling the function. But in the ppc
> wrapper, the address
>
> is intialized to 0 or malloced to be an address later. In this
> situation, I think my patch
>
> should be right for ppc64be.

Yeah maybe it's fine. I just have some memories of running into some
bizzare edge case at some point. It might have been the entrypoint of
the zImage rather than the vmlinux which had (has?) that problem.


Re: [PATCH] powerpc: Fix kernel-jump address for ppc64 wrapper boot

2021-06-07 Thread Oliver O'Halloran
On Fri, Jun 4, 2021 at 7:39 PM He Ying  wrote:
>
> From "64-bit PowerPC ELF Application Binary Interface Supplement 1.9",
> we know that the value of a function pointer in a language like C is
> the address of the function descriptor and the first doubleword
> of the function descriptor contains the address of the entry point
> of the function.
>
> So, when we want to jump to an address (e.g. addr) to execute for
> PPC-elf64abi, we should assign the address of addr *NOT* addr itself
> to the function pointer or system will jump to the wrong address.

How have you tested this?

IIRC the 64bit wrapper is only used for ppc64le builds. For that case
the current code is work because the LE ABI (ABIv2) doesn't use
function descriptors. I think even for a BE kernel we need the current
behaviour because the vmlinux's entry point is screwed up (i.e.
doesn't point a descriptor) and tools in the wild (probably kexec)
expect it to be screwed up.

ABIv2 (LE) reference:
https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture


Re: [PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable.

2021-05-06 Thread Oliver O'Halloran
On Fri, May 7, 2021 at 3:43 AM Mahesh Salgaonkar  wrote:
>
> When certain PHB HW failure causes phyp to recover PHB, it marks the PE
> state as temporarily unavailable. In this case, per PAPR, rtas call
> ibm,read-slot-reset-state2 returns a PE state as temporarily unavailable(5)
> and OS has to wait until that recovery is complete. During this state the
> slot presence check 'get-sensor-state(dr-entity-sense)' returns as DR
> connector empty which leads to assumption that the device has been
> hot-removed. This results into no EEH recovery on this device and it stays
> in failed state forever.
>
> This patch fixes this issue by skipping slot presence check only if device
> PE state is temporarily unavailable(5).
>
> Signed-off-by: Mahesh Salgaonkar 
> ---
> * snip*
>
> /*
>  * It should be corner case that the parent PE has been
> diff --git a/arch/powerpc/kernel/eeh_driver.c 
> b/arch/powerpc/kernel/eeh_driver.c
> index 3eff6a4888e79..a0913768f33de 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -851,6 +851,17 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
> return;
> }
>
> +   /*
> +* When PE's state is temporarily unavailable, the slot
> +* presence check returns as DR connector empty.

That sounds like a bug in either RTAS or the hotplug slot driver (or
both). The presence check is there largely to filter out events that
we can guarantee are not recoverable (i.e. surprise hot-unplug). In
every other case (especially if we can't determine the state) we
should be going down the recovery path. If the hotplug slot driver is
incorrectly reporting the card has been removed then you should be
fixing the slot driver.

> +* to assumption that the device is hot-removed and causes EEH
> +* recovery to stop leaving the device in failed state forever.
> +* Hence skip the slot presence check if PE's state is
> +* temporarily unavailable and go down EEH recovery path.
> +*/
> +   if (pe->state & EEH_PE_TEMP_UNAVAIL)
> +   goto skip_slot_presence_check;

There's a time-of-check-vs-time-of-use error here. You're setting this
flag at the point of detection, but there can be a significant lag
time between when an EEH is initially detected and when it's handled
by the recovery thread (usually due to other events being recovered).
Transitioning the PE into and out of PE_TEMP_UNAVAILABLE is handled
autonomously by the hypervisor so by the time we get to recovery the
PE may be back into a normal state where the slot check works fine.


Re: [PATCH] powerpc/powernv/pci: remove dead code from !CONFIG_EEH

2021-04-22 Thread Oliver O'Halloran
On Fri, Apr 23, 2021 at 9:09 AM Daniel Axtens  wrote:
>
> Hi Nick,
>
> > While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a
> > possible candidate to convert to #ifdef CONFIG_EEH, but it seems that
> > based on Kconfig dependencies it's not possible to build this file
> > without CONFIG_EEH enabled.
>
> This seemed odd to me, but I think you're right:
>
> arch/powerpc/platforms/Kconfig contains:
>
> config EEH
> bool
> depends on (PPC_POWERNV || PPC_PSERIES) && PCI
> default y
>
> It's not configurable from e.g. make menuconfig because there's no prompt.
> You can attempt to explicitly disable it with e.g. `scripts/config -d EEH`
> but then something like `make oldconfig` will silently re-enable it for
> you.
>
> It's been forced on since commit e49f7a9997c6 ("powerpc/pseries: Rivet
> CONFIG_EEH for pSeries platform") in 2012 which fixed it for
> pseries. That moved out from pseries to pseries + powernv later on.
>
> There are other cleanups in the same vein that could be made, from the
> Makefile (which has files only built with CONFIG_EEH) through to other
> source files. It looks like there's one `#ifdef CONFIG_EEH` in
> arch/powerpc/platforms/powernv/pci-ioda.c that could be pulled out, for
> example.
>
> I think it's probably worth trying to rip out all of those in one patch?

The change in commit e49f7a9997c6 ("powerpc/pseries: Rivet CONFIG_EEH
for pSeries platform") never should have been made.

There's no inherent reason why EEH needs to be enabled and forcing it
on is (IMO) a large part of why EEH support is the byzantine
clusterfuck that it is. One of the things I was working towards was
allowing pseries and powernv to be built with !CONFIG_EEH since that
would help define a clearer boundary between what is "eeh support" and
what is required to support PCI on the platform. Pseries is
particularly bad for this since PAPR says the RTAS calls needed to do
a PCI bus reset are part of the EEH extension, but there's non-EEH
reasons why you might want to use those RTAS calls. The PHB reset that
we do when entering a kdump kernel is a good example since that uses
the same RTAS calls, but it has nothing to do with the EEH recovery
machinery enabled by CONFIG_EEH.

I was looking into that largely because people were considering using
OPAL for microwatt platforms. Breaking the assumption that
powernv==EEH support is one of the few bits of work required to enable
that, but even if you don't go down that road I think everyone would
be better off if you kept a degree of separation between the two.


Re: [PATCH v2] powerpc/eeh: Fix EEH handling for hugepages in ioremap space.

2021-04-12 Thread Oliver O'Halloran
On Mon, Apr 12, 2021 at 5:52 PM Mahesh Salgaonkar  wrote:
>
> During the EEH MMIO error checking, the current implementation fails to map
> the (virtual) MMIO address back to the pci device on radix with hugepage
> mappings for I/O. This results into failure to dispatch EEH event with no
> recovery even when EEH capability has been enabled on the device.
>
> eeh_check_failure(token)# token = virtual MMIO address
>   addr = eeh_token_to_phys(token);
>   edev = eeh_addr_cache_get_dev(addr);
>   if (!edev)
> return 0;
>   eeh_dev_check_failure(edev);  <= Dispatch the EEH event
>
> In case of hugepage mappings, eeh_token_to_phys() has a bug in virt -> phys
> translation that results in wrong physical address, which is then passed to
> eeh_addr_cache_get_dev() to match it against cached pci I/O address ranges
> to get to a PCI device. Hence, it fails to find a match and the EEH event
> never gets dispatched leaving the device in failed state.
>
> The commit 33439620680be ("powerpc/eeh: Handle hugepages in ioremap space")
> introduced following logic to translate virt to phys for hugepage mappings:
>
> eeh_token_to_phys():
> +   pa = pte_pfn(*ptep);
> +
> +   /* On radix we can do hugepage mappings for io, so handle that */
> +   if (hugepage_shift) {
> +   pa <<= hugepage_shift;  <= This is wrong
> +   pa |= token & ((1ul << hugepage_shift) - 1);
> +   }

I think I vaguely remember thinking "is this right?" at the time.
Apparently not!

Reviewed-by: Oliver O'Halloran 


It would probably be a good idea to add a debugfs interface to help
with testing the address translation. Maybe something like:

echo  > /sys/kernel/debug/powerpc/eeh_addr_check

Then in the kernel:

struct resource *r = lookup_resource(mmio_addr);
void *virt = ioremap_resource(r);
ret = eeh_check_failure(virt);
iounmap(virt)

return ret;

A little tedious, but then you can write a selftest :)

Oliver


Re: [PASEMI] Nemo board doesn't boot anymore because of moving pas_pci_init

2021-02-23 Thread Oliver O'Halloran
On Wed, Feb 24, 2021 at 11:55 AM Michael Ellerman  wrote:
>
> Olof Johansson  writes:
> > Hi,
> >
> > On Tue, Feb 23, 2021 at 1:43 PM Christian Zigotzky
> >  wrote:
> >>
> >> Hello,
> >>
> >> The Nemo board [1] with a P.A. Semi PA6T SoC doesn't boot anymore
> >> because of moving "pas_pci_init" to the device tree adoption [2] in the
> >> latest PowerPC updates 5.12-1 [3].
> >>
> >> Unfortunately the Nemo board doesn't have it in its device tree. I
> >> reverted this commit and after that the Nemo board boots without any
> >> problems.
> >>
> >> What do you think about this ifdef?
> >>
> >> #ifdef CONFIG_PPC_PASEMI_NEMO
> >>  /*
> >>   * Check for the Nemo motherboard here, if we are running on one
> >>   * then pas_pci_init()
> >>   */
> >>  if (of_machine_is_compatible("pasemi,nemo")) {
> >>  pas_pci_init();
> >>  }
> >> #endif
> >
> > This is not a proper fix for the problem. Someone will need to debug
> > what on the pas_pci_init() codepath still needs to happen early in the
> > boot, even if the main PCI setup happens later.
>
> I looked but don't see anything 100% obvious.
>
> Possibly it's the call to isa_bridge_find_early()?

Looks like it. I think the problem stems from the use of the PIO
helpers (mainly outb()) in i8259_init() which is called from
nemo_init_IRQ(). The PIO helpers require the ISA space to be mapped
and io_isa_base to be set since they take a PIO register address
rather than an MMIO address. It looks like there's a few other legacy
embedded platforms that might have the same problem.

I guess the real fix would be to decouple the ISA base address
discovery from the PHB discovery. That should be doable since it's all
discovered via DT anyway and we only support one ISA address range,
but it's a bit of work.


Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.12-1 tag

2021-02-22 Thread Oliver O'Halloran
On Tue, Feb 23, 2021 at 9:44 AM Linus Torvalds
 wrote:
>
> On Mon, Feb 22, 2021 at 4:06 AM Michael Ellerman  wrote:
> >
> > Please pull powerpc updates for 5.12.
>
> Pulled. However:
>
> >  mode change 100755 => 100644 
> > tools/testing/selftests/powerpc/eeh/eeh-functions.sh
> >  create mode 100755 tools/testing/selftests/powerpc/eeh/eeh-vf-aware.sh
> >  create mode 100755 tools/testing/selftests/powerpc/eeh/eeh-vf-unaware.sh
>
> Somebody is being confused.
>
> Why create two new shell scripts with the proper executable bit, and
> then remove the executable bit from an existing one?
>
> That just seems very inconsistent.

eeh-function.sh just provides some helper functions for the other
scripts and doesn't do anything when executed directly. I thought
making it non-executable made sense.

>
>  Linus


Re: [PATCH] arch:powerpc simple_write_to_buffer return check

2021-02-04 Thread Oliver O'Halloran
On Fri, Feb 5, 2021 at 5:17 AM Mayank Suman  wrote:
>
> Signed-off-by: Mayank Suman 

commit messages aren't optional

> ---
>  arch/powerpc/kernel/eeh.c| 8 
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++--
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 813713c9120c..2dbe1558a71f 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file 
> *filp,
> char buf[20];
> int ret;
>
> -   ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
> -   if (!ret)
> +   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);

We should probably be zeroing the buffer. Reading to sizeof(buf) - 1
is done in a few places to guarantee that the string is nul
terminated, but without the preceeding memset() that isn't actually
guaranteed.

> +   if (ret <= 0)
> return -EFAULT;

EFAULT is supposed to be returned when the user supplies a buffer to
write(2) which is outside their address space. I figured letting the
sscanf() in the next step fail if the user passes writes a zero-length
buffer and returning EINVAL made more sense. That said, the exact
semantics around zero length writes are pretty handwavy so I guess
this isn't wrong, but I don't think it's better either.

> /*
> @@ -1696,7 +1696,7 @@ static ssize_t eeh_dev_check_write(struct file *filp,
>
> memset(buf, 0, sizeof(buf));
> ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);
> -   if (!ret)
> +   if (ret <= 0)
> return -EFAULT;
>
> ret = sscanf(buf, "%x:%x:%x.%x", , , , );
> @@ -1836,7 +1836,7 @@ static ssize_t eeh_dev_break_write(struct file *filp,
>
> memset(buf, 0, sizeof(buf));
> ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);
> -   if (!ret)
> +   if (ret <= 0)
> return -EFAULT;
>
> ret = sscanf(buf, "%x:%x:%x.%x", , , , );
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
> b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 89e22c460ebf..36ed2b8f7375 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -76,8 +76,8 @@ static ssize_t pnv_eeh_ei_write(struct file *filp,
> return -ENXIO;
>
> /* Copy over argument buffer */
> -   ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
> -   if (!ret)
> +   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);
> +   if (ret <= 0)
> return -EFAULT;
>
> /* Retrieve parameters */
> --
> 2.30.0
>


Re: [PATCH] powerpc/eeh: remove unneeded semicolon

2021-02-01 Thread Oliver O'Halloran
On Tue, Feb 2, 2021 at 2:21 PM Yang Li  wrote:
>
> Eliminate the following coccicheck warning:
> ./arch/powerpc/kernel/eeh.c:782:2-3: Unneeded semicolon

Doesn't appear to be a load-bearing semicolon. It's hard to tell with EEH.

Reviewed-by: Oliver O'Halloran 

>
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 
> ---
>  arch/powerpc/kernel/eeh.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 813713c..02c058d 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -779,7 +779,7 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, 
> enum pcie_reset_state stat
> default:
> eeh_pe_state_clear(pe, EEH_PE_ISOLATED | EEH_PE_CFG_BLOCKED, 
> true);
> return -EINVAL;
> -   };
> +   }
>
> return 0;
>  }
> --
> 1.8.3.1
>


Re: [PATCH 3/3] selftests/powerpc: Add VF recovery tests

2020-11-24 Thread Oliver O'Halloran
On Tue, Nov 24, 2020 at 9:14 PM Frederic Barrat  wrote:
>
> Is it possible to run those tests on pseries? I haven't managed to set
> up a LPAR with a physical function which would let me enable a virtual
> function. All I could do is assign a virtual function to a LPAR. When
> assigning a physical function to the LPAR, enabling a virtual function
> fails because of missing properties in the device tree, so it looks like
> the hypervisor doesn't support it (?).
>
> Same story on qemu.
>
>Fred

IIRC having the guest manage SR-IOV was a half-baked feature that
never made it into a production phyp build. I never managed to get any
real documentation from the phyp folks about how it worked either. As
far as I can tell it's pretty similar to what we do on PowerNV with
the PE configuration being handled by h-call rather than OPAL call.
The main difference would be in how EEH freezes are handled and I know
there's *something* going on there, but I never really understood it
due to the lack of documentation.

I've been tempted to rip out all that crap a few times, but never
really got around to it. There was also some noises about implementing
support for guest managed SRIOV in pseries qemu, but I'm not sure what
ever happened to that.

Oliver


[PATCH 2/2] powerpc/eeh: Add a debugfs interface to check if a driver supports recovery

2020-11-02 Thread Oliver O'Halloran
If a PCI device's current driver implements the error handling callbacks
EEH can use them to recover the device after an error occurs. For devices
without the error handling callbacks we recover them by removing the device
and re-scanning it so the PCI core puts the device back into a known good
state.

Currently there's no way for userspace to determine if the driver supports
recovery or not which makes it difficult to write automated tests for EEH.
This patch addressing that by adding a debugfs interface for querying if
a specific device can be recovered or not.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/eeh.c | 50 +++
 1 file changed, 50 insertions(+)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index f9182ff57804..cd60bc1c8701 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1868,6 +1868,53 @@ static const struct file_operations eeh_dev_break_fops = 
{
.read   = eeh_debugfs_dev_usage,
 };
 
+static ssize_t eeh_dev_can_recover(struct file *filp,
+  const char __user *user_buf,
+  size_t count, loff_t *ppos)
+{
+   struct pci_driver *drv;
+   struct pci_dev *pdev;
+   size_t ret;
+
+   pdev = eeh_debug_lookup_pdev(filp, user_buf, count, ppos);
+   if (IS_ERR(pdev))
+   return PTR_ERR(pdev);
+
+   /*
+* In order for error recovery to work the driver needs to implement
+* .error_detected(), so it can quiesce IO to the device, and
+* .slot_reset() so it can re-initialise the device after a reset.
+*
+* Ideally they'd implement .resume() too, but some drivers which
+* we need to support (notably IPR) don't so I guess we can tolerate
+* that.
+*
+* .mmio_enabled() is mostly there as a work-around for devices which
+* take forever to re-init after a hot reset. Implementing that is
+* strictly optional.
+*/
+   drv = pci_dev_driver(pdev);
+   if (drv &&
+   drv->err_handler &&
+   drv->err_handler->error_detected &&
+   drv->err_handler->slot_reset) {
+   ret = count;
+   } else {
+   ret = -EOPNOTSUPP;
+   }
+
+   pci_dev_put(pdev);
+
+   return ret;
+}
+
+static const struct file_operations eeh_dev_can_recover_fops = {
+   .open   = simple_open,
+   .llseek = no_llseek,
+   .write  = eeh_dev_can_recover,
+   .read   = eeh_debugfs_dev_usage,
+};
+
 #endif
 
 static int __init eeh_init_proc(void)
@@ -1892,6 +1939,9 @@ static int __init eeh_init_proc(void)
debugfs_create_file_unsafe("eeh_force_recover", 0600,
powerpc_debugfs_root, NULL,
_force_recover_fops);
+   debugfs_create_file_unsafe("eeh_dev_can_recover", 0600,
+   powerpc_debugfs_root, NULL,
+   _dev_can_recover_fops);
eeh_cache_debugfs_init();
 #endif
}
-- 
2.26.2



[PATCH 1/2] powerpc/eeh: Rework pci_dev lookup in debugfs attributes

2020-11-02 Thread Oliver O'Halloran
Pull the string -> pci_dev lookup stuff into a helper function. No functional 
change.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/eeh.c | 71 ---
 1 file changed, 37 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 813713c9120c..f9182ff57804 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1596,6 +1596,35 @@ static int proc_eeh_show(struct seq_file *m, void *v)
 }
 
 #ifdef CONFIG_DEBUG_FS
+
+
+static struct pci_dev *eeh_debug_lookup_pdev(struct file *filp,
+const char __user *user_buf,
+size_t count, loff_t *ppos)
+{
+   uint32_t domain, bus, dev, fn;
+   struct pci_dev *pdev;
+   char buf[20];
+   int ret;
+
+   memset(buf, 0, sizeof(buf));
+   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
+   if (!ret)
+   return ERR_PTR(-EFAULT);
+
+   ret = sscanf(buf, "%x:%x:%x.%x", , , , );
+   if (ret != 4) {
+   pr_err("%s: expected 4 args, got %d\n", __func__, ret);
+   return ERR_PTR(-EINVAL);
+   }
+
+   pdev = pci_get_domain_bus_and_slot(domain, bus, (dev << 3) | fn);
+   if (!pdev)
+   return ERR_PTR(-ENODEV);
+
+   return pdev;
+}
+
 static int eeh_enable_dbgfs_set(void *data, u64 val)
 {
if (val)
@@ -1688,26 +1717,13 @@ static ssize_t eeh_dev_check_write(struct file *filp,
const char __user *user_buf,
size_t count, loff_t *ppos)
 {
-   uint32_t domain, bus, dev, fn;
struct pci_dev *pdev;
struct eeh_dev *edev;
-   char buf[20];
int ret;
 
-   memset(buf, 0, sizeof(buf));
-   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
-   if (!ret)
-   return -EFAULT;
-
-   ret = sscanf(buf, "%x:%x:%x.%x", , , , );
-   if (ret != 4) {
-   pr_err("%s: expected 4 args, got %d\n", __func__, ret);
-   return -EINVAL;
-   }
-
-   pdev = pci_get_domain_bus_and_slot(domain, bus, (dev << 3) | fn);
-   if (!pdev)
-   return -ENODEV;
+   pdev = eeh_debug_lookup_pdev(filp, user_buf, count, ppos);
+   if (IS_ERR(pdev))
+   return PTR_ERR(pdev);
 
edev = pci_dev_to_eeh_dev(pdev);
if (!edev) {
@@ -1717,8 +1733,8 @@ static ssize_t eeh_dev_check_write(struct file *filp,
}
 
ret = eeh_dev_check_failure(edev);
-   pci_info(pdev, "eeh_dev_check_failure(%04x:%02x:%02x.%01x) = %d\n",
-   domain, bus, dev, fn, ret);
+   pci_info(pdev, "eeh_dev_check_failure(%s) = %d\n",
+   pci_name(pdev), ret);
 
pci_dev_put(pdev);
 
@@ -1829,25 +1845,12 @@ static ssize_t eeh_dev_break_write(struct file *filp,
const char __user *user_buf,
size_t count, loff_t *ppos)
 {
-   uint32_t domain, bus, dev, fn;
struct pci_dev *pdev;
-   char buf[20];
int ret;
 
-   memset(buf, 0, sizeof(buf));
-   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
-   if (!ret)
-   return -EFAULT;
-
-   ret = sscanf(buf, "%x:%x:%x.%x", , , , );
-   if (ret != 4) {
-   pr_err("%s: expected 4 args, got %d\n", __func__, ret);
-   return -EINVAL;
-   }
-
-   pdev = pci_get_domain_bus_and_slot(domain, bus, (dev << 3) | fn);
-   if (!pdev)
-   return -ENODEV;
+   pdev = eeh_debug_lookup_pdev(filp, user_buf, count, ppos);
+   if (IS_ERR(pdev))
+   return PTR_ERR(pdev);
 
ret = eeh_debugfs_break_device(pdev);
pci_dev_put(pdev);
-- 
2.26.2



[PATCH 3/3] selftests/powerpc: Add VF recovery tests

2020-11-02 Thread Oliver O'Halloran
The basic EEH test ignores VFs since we the way the eeh_dev_break debugfs
interface works means that if multiple VFs are enabled we may cause errors
on all them them. However, we can work around that by only enabling a
single VF at a time.

This patch adds some infrastructure for finding SR-IOV capable devices and
enabling / disabling VFs so we can exercise the VF specific EEH recovery
paths. Two new tests are added, one for testing EEH aware devices and one
for EEH un-aware VFs.

Signed-off-by: Oliver O'Halloran 
---
 .../selftests/powerpc/eeh/eeh-functions.sh| 108 ++
 .../selftests/powerpc/eeh/eeh-vf-aware.sh |  45 
 .../selftests/powerpc/eeh/eeh-vf-unaware.sh   |  35 ++
 3 files changed, 188 insertions(+)
 create mode 100755 tools/testing/selftests/powerpc/eeh/eeh-vf-aware.sh
 create mode 100755 tools/testing/selftests/powerpc/eeh/eeh-vf-unaware.sh

diff --git a/tools/testing/selftests/powerpc/eeh/eeh-functions.sh 
b/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
index 32e5b7fbf18a..70daa3925dcb 100644
--- a/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
+++ b/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
@@ -135,3 +135,111 @@ eeh_one_dev() {
return 0;
 }
 
+eeh_has_driver() {
+   test -e /sys/bus/pci/devices/$1/driver;
+   return $?
+}
+
+eeh_can_recover() {
+   # we'll get an IO error if the device's current driver doesn't support
+   # error recovery
+   echo $1 > '/sys/kernel/debug/powerpc/eeh_dev_can_recover' 2>/dev/null
+
+   return $?
+}
+
+eeh_find_all_pfs() {
+   devices=""
+
+   # SR-IOV on pseries requires hypervisor support, so check for that
+   is_pseries=""
+   if grep -q pSeries /proc/cpuinfo ; then
+   if [ ! -f /proc/device-tree/rtas/ibm,open-sriov-allow-unfreeze 
] ||
+  [ ! -f /proc/device-tree/rtas/ibm,open-sriov-map-pe-number ] 
; then
+   return 1;
+   fi
+
+   is_pseries="true"
+   fi
+
+   for dev in `ls -1 /sys/bus/pci/devices/` ; do
+   sysfs="/sys/bus/pci/devices/$dev"
+   if [ ! -e "$sysfs/sriov_numvfs" ] ; then
+   continue
+   fi
+
+   # skip unsupported PFs on pseries
+   if [ -z "$is_pseries" ] &&
+  [ ! -f "$sysfs/of_node/ibm,is-open-sriov-pf" ] &&
+  [ ! -f "$sysfs/of_node/ibm,open-sriov-vf-bar-info" ] ; then
+   continue;
+   fi
+
+   # no driver, no vfs
+   if ! eeh_has_driver $dev ; then
+   continue
+   fi
+
+   devices="$devices $dev"
+   done
+
+   if [ -z "$devices" ] ; then
+   return 1;
+   fi
+
+   echo $devices
+   return 0;
+}
+
+# attempts to enable one VF on each PF so we can do VF specific tests.
+# stdout: list of enabled VFs, one per line
+# return code: 0 if vfs are found, 1 otherwise
+eeh_enable_vfs() {
+   pf_list="$(eeh_find_all_pfs)"
+
+   vfs=0
+   for dev in $pf_list ; do
+   pf_sysfs="/sys/bus/pci/devices/$dev"
+
+   # make sure we have a single VF
+   echo 0 > "$pf_sysfs/sriov_numvfs"
+   echo 1 > "$pf_sysfs/sriov_numvfs"
+   if [ "$?" != 0 ] ; then
+   log "Unable to enable VFs on $pf, skipping"
+   continue;
+   fi
+
+   vf="$(basename $(realpath "$pf_sysfs/virtfn0"))"
+   if [ $? != 0 ] ; then
+   log "unable to find enabled vf on $pf"
+   echo 0 > "$pf_sysfs/sriov_numvfs"
+   continue;
+   fi
+
+   if ! eeh_can_break $vf ; then
+   log "skipping "
+
+   echo 0 > "$pf_sysfs/sriov_numvfs"
+   continue;
+   fi
+
+   vfs="$((vfs + 1))"
+   echo $vf
+   done
+
+   test "$vfs" != 0
+   return $?
+}
+
+eeh_disable_vfs() {
+   pf_list="$(eeh_find_all_pfs)"
+   if [ -z "$pf_list" ] ; then
+   return 1;
+   fi
+
+   for dev in $pf_list ; do
+   echo 0 > "/sys/bus/pci/devices/$dev/sriov_numvfs"
+   done
+
+   return 0;
+}
diff --git a/tools/testing/selftests/powerpc/eeh/eeh-vf-aware.sh 
b/tools/testing/selftests/powerpc/eeh/eeh-vf-aware.sh
new file mode 100755
index ..874c11953bb6
--- /dev/null
+++ b/tools/testing/selftests/powerpc/eeh/eeh-vf-aware.sh
@@ -0,0 +1,45 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-only
+

[PATCH 2/3] selftests/powerpc: Use stderr for debug messages in eeh-functions

2020-11-02 Thread Oliver O'Halloran
We want to use stdout to return lists of devices, etc so log debug / status
messages to stderr rather than stdout.

Signed-off-by: Oliver O'Halloran 
---
 .../selftests/powerpc/eeh/eeh-functions.sh| 20 +++
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/powerpc/eeh/eeh-functions.sh 
b/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
index 9b1bcc1fd4ad..32e5b7fbf18a 100644
--- a/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
+++ b/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
@@ -3,6 +3,10 @@
 
 export KSELFTESTS_SKIP=4
 
+log() {
+   echo >/dev/stderr $*
+}
+
 pe_ok() {
local dev="$1"
local path="/sys/bus/pci/devices/$dev/eeh_pe_state"
@@ -49,7 +53,7 @@ eeh_test_prep() {
 
if [ ! -e "/sys/kernel/debug/powerpc/eeh_dev_check" ] && \
   [ ! -e "/sys/kernel/debug/powerpc/eeh_dev_break" ] ; then
-   echo "debugfs EEH testing files are missing. Is debugfs 
mounted?"
+   log "debugfs EEH testing files are missing. Is debugfs mounted?"
exit $KSELFTESTS_SKIP;
fi
 
@@ -61,7 +65,7 @@ eeh_test_prep() {
 eeh_can_break() {
# skip bridges since we can't recover them (yet...)
if [ -e "/sys/bus/pci/devices/$dev/pci_bus" ] ; then
-   echo "$dev, Skipped: bridge"
+   log "$dev, Skipped: bridge"
return 1;
fi
 
@@ -70,7 +74,7 @@ eeh_can_break() {
# it the system will generally go down. We should probably fix that
# at some point
if [ "ahci" = "$(basename $(realpath 
/sys/bus/pci/devices/$dev/driver))" ] ; then
-   echo "$dev, Skipped: ahci doesn't support recovery"
+   log "$dev, Skipped: ahci doesn't support recovery"
return 1;
fi
 
@@ -80,7 +84,7 @@ eeh_can_break() {
# result in the recovery failing and the device being marked as
# failed.
if ! pe_ok $dev ; then
-   echo "$dev, Skipped: Bad initial PE state"
+   log "$dev, Skipped: Bad initial PE state"
return 1;
fi
 
@@ -94,7 +98,7 @@ eeh_one_dev() {
# testing so check that the argument is a well-formed sysfs device
# name.
if ! test -e /sys/bus/pci/devices/$dev/ ; then
-   echo "Error: '$dev' must be a sysfs device name (:BB:DD.F)"
+   log "Error: '$dev' must be a sysfs device name (:BB:DD.F)"
return 1;
fi
 
@@ -118,16 +122,16 @@ eeh_one_dev() {
if pe_ok $dev ; then
break;
fi
-   echo "$dev, waited $i/${max_wait}"
+   log "$dev, waited $i/${max_wait}"
sleep 1
done
 
if ! pe_ok $dev ; then
-   echo "$dev, Failed to recover!"
+   log "$dev, Failed to recover!"
return 1;
fi
 
-   echo "$dev, Recovered after $i seconds"
+   log "$dev, Recovered after $i seconds"
return 0;
 }
 
-- 
2.26.2



[PATCH 1/3] selftests/powerpc: Hoist helper code out of eeh-basic

2020-11-02 Thread Oliver O'Halloran
Hoist some of the useful test environment checking and prep code into
eeh-functions.sh so they can be reused in other tests.

Signed-off-by: Oliver O'Halloran 
---
 .../selftests/powerpc/eeh/eeh-basic.sh| 39 ++-
 .../selftests/powerpc/eeh/eeh-functions.sh| 48 +++
 2 files changed, 51 insertions(+), 36 deletions(-)
 mode change 100755 => 100644 
tools/testing/selftests/powerpc/eeh/eeh-functions.sh

diff --git a/tools/testing/selftests/powerpc/eeh/eeh-basic.sh 
b/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
index 0d783e1065c8..16d00555f13e 100755
--- a/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
+++ b/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
@@ -1,28 +1,13 @@
 #!/bin/sh
 # SPDX-License-Identifier: GPL-2.0-only
 
-KSELFTESTS_SKIP=4
-
 . ./eeh-functions.sh
 
-if ! eeh_supported ; then
-   echo "EEH not supported on this system, skipping"
-   exit $KSELFTESTS_SKIP;
-fi
-
-if [ ! -e "/sys/kernel/debug/powerpc/eeh_dev_check" ] && \
-   [ ! -e "/sys/kernel/debug/powerpc/eeh_dev_break" ] ; then
-   echo "debugfs EEH testing files are missing. Is debugfs mounted?"
-   exit $KSELFTESTS_SKIP;
-fi
+eeh_test_prep # NB: may exit
 
 pre_lspci=`mktemp`
 lspci > $pre_lspci
 
-# Bump the max freeze count to something absurd so we don't
-# trip over it while breaking things.
-echo 5000 > /sys/kernel/debug/powerpc/eeh_max_freezes
-
 # record the devices that we break in here. Assuming everything
 # goes to plan we should get them back once the recover process
 # is finished.
@@ -30,34 +15,16 @@ devices=""
 
 # Build up a list of candidate devices.
 for dev in `ls -1 /sys/bus/pci/devices/ | grep '\.0$'` ; do
-   # skip bridges since we can't recover them (yet...)
-   if [ -e "/sys/bus/pci/devices/$dev/pci_bus" ] ; then
-   echo "$dev, Skipped: bridge"
+   if ! eeh_can_break $dev ; then
continue;
fi
 
-   # Skip VFs for now since we don't have a reliable way
-   # to break them.
+   # Skip VFs for now since we don't have a reliable way to break them.
if [ -e "/sys/bus/pci/devices/$dev/physfn" ] ; then
echo "$dev, Skipped: virtfn"
continue;
fi
 
-   if [ "ahci" = "$(basename $(realpath 
/sys/bus/pci/devices/$dev/driver))" ] ; then
-   echo "$dev, Skipped: ahci doesn't support recovery"
-   continue
-   fi
-
-   # Don't inject errosr into an already-frozen PE. This happens with
-   # PEs that contain multiple PCI devices (e.g. multi-function cards)
-   # and injecting new errors during the recovery process will probably
-   # result in the recovery failing and the device being marked as
-   # failed.
-   if ! pe_ok $dev ; then
-   echo "$dev, Skipped: Bad initial PE state"
-   continue;
-   fi
-
echo "$dev, Added"
 
# Add to this list of device to check
diff --git a/tools/testing/selftests/powerpc/eeh/eeh-functions.sh 
b/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
old mode 100755
new mode 100644
index 00dc32c0ed75..9b1bcc1fd4ad
--- a/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
+++ b/tools/testing/selftests/powerpc/eeh/eeh-functions.sh
@@ -1,6 +1,8 @@
 #!/bin/sh
 # SPDX-License-Identifier: GPL-2.0-only
 
+export KSELFTESTS_SKIP=4
+
 pe_ok() {
local dev="$1"
local path="/sys/bus/pci/devices/$dev/eeh_pe_state"
@@ -39,6 +41,52 @@ eeh_supported() {
grep -q 'EEH Subsystem is enabled' /proc/powerpc/eeh
 }
 
+eeh_test_prep() {
+   if ! eeh_supported ; then
+   echo "EEH not supported on this system, skipping"
+   exit $KSELFTESTS_SKIP;
+   fi
+
+   if [ ! -e "/sys/kernel/debug/powerpc/eeh_dev_check" ] && \
+  [ ! -e "/sys/kernel/debug/powerpc/eeh_dev_break" ] ; then
+   echo "debugfs EEH testing files are missing. Is debugfs 
mounted?"
+   exit $KSELFTESTS_SKIP;
+   fi
+
+   # Bump the max freeze count to something absurd so we don't
+   # trip over it while breaking things.
+   echo 5000 > /sys/kernel/debug/powerpc/eeh_max_freezes
+}
+
+eeh_can_break() {
+   # skip bridges since we can't recover them (yet...)
+   if [ -e "/sys/bus/pci/devices/$dev/pci_bus" ] ; then
+   echo "$dev, Skipped: bridge"
+   return 1;
+   fi
+
+   # The ahci driver doesn't support error recovery. If the ahci device
+   # happens to be hosting the root filesystem, and then we go and break
+   # it the system will generally go down. We should probably fix that
+   # at some point
+   if [ "ahci" = "$(basename $(realpath 
/sys/bus/pci/devices/$d

Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed

2020-11-02 Thread Oliver O'Halloran
On Tue, Nov 3, 2020 at 1:39 AM Cédric Le Goater  wrote:
>
> On 10/14/20 4:55 AM, Alexey Kardashevskiy wrote:
> >
> > How do you remove PHBs exactly? There is no such thing in the powernv 
> > platform, I thought someone added this and you are fixing it but no. PHBs 
> > on powernv are created at the boot time and there is no way to remove them, 
> > you can only try removing all the bridges.
>
> yes. I noticed that later when proposing the fix for the double
> free.
>
> > So what exactly are you doing?
>
> What you just said above, with the commands :
>
>   echo 1 >  /sys/devices/pci0031\:00/0031\:00\:00.0/remove
>   echo 1 >  /sys/devices/pci0031\:00/pci_bus/0031\:00/rescan

Right, so that'll remove the root port device (and Bus 01 beneath it),
but the PHB itself is still there. If it was removed the root bus
would also disappear.


[PATCH 18/18] powerpc/powermac: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with pmac32_defconfig and g5_defconfig
---
 arch/powerpc/platforms/powermac/setup.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powermac/setup.c 
b/arch/powerpc/platforms/powermac/setup.c
index 2e2cc0c75d87..86aee3f2483f 100644
--- a/arch/powerpc/platforms/powermac/setup.c
+++ b/arch/powerpc/platforms/powermac/setup.c
@@ -298,9 +298,6 @@ static void __init pmac_setup_arch(void)
of_node_put(ic);
}
 
-   /* Lookup PCI hosts */
-   pmac_pci_init();
-
 #ifdef CONFIG_PPC32
ohare_init();
l2cr_init();
@@ -600,6 +597,7 @@ define_machine(powermac) {
.name   = "PowerMac",
.probe  = pmac_probe,
.setup_arch = pmac_setup_arch,
+   .discover_phbs  = pmac_pci_init,
.show_cpuinfo   = pmac_show_cpuinfo,
.init_IRQ   = pmac_pic_init,
.get_irq= NULL, /* changed later */
-- 
2.26.2



[PATCH 17/18] powerpc/pasemi: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with pasemi_defconfig
---
 arch/powerpc/platforms/pasemi/setup.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pasemi/setup.c 
b/arch/powerpc/platforms/pasemi/setup.c
index b612474f8f8e..376797eb7894 100644
--- a/arch/powerpc/platforms/pasemi/setup.c
+++ b/arch/powerpc/platforms/pasemi/setup.c
@@ -144,8 +144,6 @@ static void __init pas_setup_arch(void)
/* Setup SMP callback */
smp_ops = _smp_ops;
 #endif
-   /* Lookup PCI hosts */
-   pas_pci_init();
 
/* Remap SDC register for doing reset */
/* XXXOJN This should maybe come out of the device tree */
@@ -446,6 +444,7 @@ define_machine(pasemi) {
.name   = "PA Semi PWRficient",
.probe  = pas_probe,
.setup_arch = pas_setup_arch,
+   .discover_phbs  = pas_pci_init,
.init_IRQ   = pas_init_IRQ,
.get_irq= mpic_get_irq,
.restart= pas_restart,
-- 
2.26.2



[PATCH 16/18] powerpc/embedded6xx/mve5100: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with mvme5100_defconfig
---
 arch/powerpc/platforms/embedded6xx/mvme5100.c   | 13 -
 arch/powerpc/platforms/embedded6xx/storcenter.c |  8 ++--
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/embedded6xx/mvme5100.c 
b/arch/powerpc/platforms/embedded6xx/mvme5100.c
index 1cd488daa0bf..c06a0490d157 100644
--- a/arch/powerpc/platforms/embedded6xx/mvme5100.c
+++ b/arch/powerpc/platforms/embedded6xx/mvme5100.c
@@ -154,17 +154,19 @@ static const struct of_device_id mvme5100_of_bus_ids[] 
__initconst = {
  */
 static void __init mvme5100_setup_arch(void)
 {
-   struct device_node *np;
-
if (ppc_md.progress)
ppc_md.progress("mvme5100_setup_arch()", 0);
 
-   for_each_compatible_node(np, "pci", "hawk-pci")
-   mvme5100_add_bridge(np);
-
restart = ioremap(BOARD_MODRST_REG, 4);
 }
 
+static void __init mvme5100_setup_pci(void)
+{
+   struct device_node *np;
+
+   for_each_compatible_node(np, "pci", "hawk-pci")
+   mvme5100_add_bridge(np);
+}
 
 static void mvme5100_show_cpuinfo(struct seq_file *m)
 {
@@ -205,6 +207,7 @@ define_machine(mvme5100) {
.name   = "MVME5100",
.probe  = mvme5100_probe,
.setup_arch = mvme5100_setup_arch,
+   .discover_phbs  = mvme5100_setup_pci,
.init_IRQ   = mvme5100_pic_init,
.show_cpuinfo   = mvme5100_show_cpuinfo,
.get_irq= mpic_get_irq,
diff --git a/arch/powerpc/platforms/embedded6xx/storcenter.c 
b/arch/powerpc/platforms/embedded6xx/storcenter.c
index e346ddcef45e..e188b90f7016 100644
--- a/arch/powerpc/platforms/embedded6xx/storcenter.c
+++ b/arch/powerpc/platforms/embedded6xx/storcenter.c
@@ -65,14 +65,17 @@ static int __init storcenter_add_bridge(struct device_node 
*dev)
 }
 
 static void __init storcenter_setup_arch(void)
+{
+   printk(KERN_INFO "IOMEGA StorCenter\n");
+}
+
+static void __init storcenter_setup_pci(void)
 {
struct device_node *np;
 
/* Lookup PCI host bridges */
for_each_compatible_node(np, "pci", "mpc10x-pci")
storcenter_add_bridge(np);
-
-   printk(KERN_INFO "IOMEGA StorCenter\n");
 }
 
 /*
@@ -117,6 +120,7 @@ define_machine(storcenter){
.name   = "IOMEGA StorCenter",
.probe  = storcenter_probe,
.setup_arch = storcenter_setup_arch,
+   .discover_phbs  = storcenter_setup_pci,
.init_IRQ   = storcenter_init_IRQ,
.get_irq= mpic_get_irq,
.restart= storcenter_restart,
-- 
2.26.2



[PATCH 15/18] powerpc/embedded6xx/mpc7448: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with mpc7448_hpc2_defconfig
---
 arch/powerpc/platforms/embedded6xx/mpc7448_hpc2.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/embedded6xx/mpc7448_hpc2.c 
b/arch/powerpc/platforms/embedded6xx/mpc7448_hpc2.c
index b95c3380d2b5..5565647dc879 100644
--- a/arch/powerpc/platforms/embedded6xx/mpc7448_hpc2.c
+++ b/arch/powerpc/platforms/embedded6xx/mpc7448_hpc2.c
@@ -58,16 +58,14 @@ int mpc7448_hpc2_exclude_device(struct pci_controller *hose,
return PCIBIOS_SUCCESSFUL;
 }
 
-static void __init mpc7448_hpc2_setup_arch(void)
+static void __init mpc7448_hpc2_setup_pci(void)
 {
+#ifdef CONFIG_PCI
struct device_node *np;
if (ppc_md.progress)
-   ppc_md.progress("mpc7448_hpc2_setup_arch():set_bridge", 0);
-
-   tsi108_csr_vir_base = get_vir_csrbase();
+   ppc_md.progress("mpc7448_hpc2_setup_pci():set_bridge", 0);
 
/* setup PCI host bridge */
-#ifdef CONFIG_PCI
for_each_compatible_node(np, "pci", "tsi108-pci")
tsi108_setup_pci(np, MPC7448HPC2_PCI_CFG_PHYS, 0);
 
@@ -75,6 +73,11 @@ static void __init mpc7448_hpc2_setup_arch(void)
if (ppc_md.progress)
ppc_md.progress("tsi108: resources set", 0x100);
 #endif
+}
+
+static void __init mpc7448_hpc2_setup_arch(void)
+{
+   tsi108_csr_vir_base = get_vir_csrbase();
 
printk(KERN_INFO "MPC7448HPC2 (TAIGA) Platform\n");
printk(KERN_INFO
@@ -181,6 +184,7 @@ define_machine(mpc7448_hpc2){
.name   = "MPC7448 HPC2",
.probe  = mpc7448_hpc2_probe,
.setup_arch = mpc7448_hpc2_setup_arch,
+   .discover_phbs  = mpc7448_hpc2_setup_pci,
.init_IRQ   = mpc7448_hpc2_init_IRQ,
.show_cpuinfo   = mpc7448_hpc2_show_cpuinfo,
.get_irq= mpic_get_irq,
-- 
2.26.2



[PATCH 14/18] powerpc/embedded6xx/linkstation: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with linkstation_defconfig
---
 arch/powerpc/platforms/embedded6xx/linkstation.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/embedded6xx/linkstation.c 
b/arch/powerpc/platforms/embedded6xx/linkstation.c
index f514d5d28cd4..eb8342e7f84e 100644
--- a/arch/powerpc/platforms/embedded6xx/linkstation.c
+++ b/arch/powerpc/platforms/embedded6xx/linkstation.c
@@ -63,15 +63,18 @@ static int __init linkstation_add_bridge(struct device_node 
*dev)
 }
 
 static void __init linkstation_setup_arch(void)
+{
+   printk(KERN_INFO "BUFFALO Network Attached Storage Series\n");
+   printk(KERN_INFO "(C) 2002-2005 BUFFALO INC.\n");
+}
+
+static void __init linkstation_setup_pci(void)
 {
struct device_node *np;
 
/* Lookup PCI host bridges */
for_each_compatible_node(np, "pci", "mpc10x-pci")
linkstation_add_bridge(np);
-
-   printk(KERN_INFO "BUFFALO Network Attached Storage Series\n");
-   printk(KERN_INFO "(C) 2002-2005 BUFFALO INC.\n");
 }
 
 /*
@@ -153,6 +156,7 @@ define_machine(linkstation){
.name   = "Buffalo Linkstation",
.probe  = linkstation_probe,
.setup_arch = linkstation_setup_arch,
+   .discover_phbs  = linkstation_setup_pci,
.init_IRQ   = linkstation_init_IRQ,
.show_cpuinfo   = linkstation_show_cpuinfo,
.get_irq= mpic_get_irq,
-- 
2.26.2



[PATCH 13/18] powerpc/embedded6xx/holly: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with holly_defconfig
---
 arch/powerpc/platforms/embedded6xx/holly.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/embedded6xx/holly.c 
b/arch/powerpc/platforms/embedded6xx/holly.c
index d8f2e2c737bb..53065d564161 100644
--- a/arch/powerpc/platforms/embedded6xx/holly.c
+++ b/arch/powerpc/platforms/embedded6xx/holly.c
@@ -108,15 +108,13 @@ static void holly_remap_bridge(void)
tsi108_write_reg(TSI108_PCI_P2O_BAR2, 0x0);
 }
 
-static void __init holly_setup_arch(void)
+static void __init holly_init_pci(void)
 {
struct device_node *np;
 
if (ppc_md.progress)
ppc_md.progress("holly_setup_arch():set_bridge", 0);
 
-   tsi108_csr_vir_base = get_vir_csrbase();
-
/* setup PCI host bridge */
holly_remap_bridge();
 
@@ -127,6 +125,11 @@ static void __init holly_setup_arch(void)
ppc_md.pci_exclude_device = holly_exclude_device;
if (ppc_md.progress)
ppc_md.progress("tsi108: resources set", 0x100);
+}
+
+static void __init holly_setup_arch(void)
+{
+   tsi108_csr_vir_base = get_vir_csrbase();
 
printk(KERN_INFO "PPC750GX/CL Platform\n");
 }
@@ -259,6 +262,7 @@ define_machine(holly){
.name   = "PPC750 GX/CL TSI",
.probe  = holly_probe,
.setup_arch = holly_setup_arch,
+   .discover_phbs  = holly_init_pci,
.init_IRQ   = holly_init_IRQ,
.show_cpuinfo   = holly_show_cpuinfo,
.get_irq= mpic_get_irq,
-- 
2.26.2



[PATCH 12/18] powerpc/chrp: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with chrp32_defconfig
---
 arch/powerpc/platforms/chrp/pci.c   |  8 
 arch/powerpc/platforms/chrp/setup.c | 12 +---
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/chrp/pci.c 
b/arch/powerpc/platforms/chrp/pci.c
index b2c2bf35b76c..8c421dc78b28 100644
--- a/arch/powerpc/platforms/chrp/pci.c
+++ b/arch/powerpc/platforms/chrp/pci.c
@@ -314,6 +314,14 @@ chrp_find_bridges(void)
}
}
of_node_put(root);
+
+   /*
+*  "Temporary" fixes for PCI devices.
+*  -- Geert
+*/
+   hydra_init();   /* Mac I/O */
+
+   pci_create_OF_bus_map();
 }
 
 /* SL82C105 IDE Control/Status Register */
diff --git a/arch/powerpc/platforms/chrp/setup.c 
b/arch/powerpc/platforms/chrp/setup.c
index c45435aa5e36..3cfc382841e5 100644
--- a/arch/powerpc/platforms/chrp/setup.c
+++ b/arch/powerpc/platforms/chrp/setup.c
@@ -334,22 +334,11 @@ static void __init chrp_setup_arch(void)
/* On pegasos, enable the L2 cache if not already done by OF */
pegasos_set_l2cr();
 
-   /* Lookup PCI host bridges */
-   chrp_find_bridges();
-
-   /*
-*  Temporary fixes for PCI devices.
-*  -- Geert
-*/
-   hydra_init();   /* Mac I/O */
-
/*
 *  Fix the Super I/O configuration
 */
sio_init();
 
-   pci_create_OF_bus_map();
-
/*
 * Print the banner, then scroll down so boot progress
 * can be printed.  -- Cort
@@ -582,6 +571,7 @@ define_machine(chrp) {
.name   = "CHRP",
.probe  = chrp_probe,
.setup_arch = chrp_setup_arch,
+   .discover_phbs  = chrp_find_bridges,
.init   = chrp_init2,
.show_cpuinfo   = chrp_show_cpuinfo,
.init_IRQ   = chrp_init_IRQ,
-- 
2.26.2



[PATCH 11/18] powerpc/amigaone: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with amigaone_defconfig
---
 arch/powerpc/platforms/amigaone/setup.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/amigaone/setup.c 
b/arch/powerpc/platforms/amigaone/setup.c
index f5d0bf999759..b25ddf39dd43 100644
--- a/arch/powerpc/platforms/amigaone/setup.c
+++ b/arch/powerpc/platforms/amigaone/setup.c
@@ -65,6 +65,12 @@ static int __init amigaone_add_bridge(struct device_node 
*dev)
 }
 
 void __init amigaone_setup_arch(void)
+{
+   if (ppc_md.progress)
+   ppc_md.progress("Linux/PPC "UTS_RELEASE"\n", 0);
+}
+
+void __init amigaone_discover_phbs(void)
 {
struct device_node *np;
int phb = -ENODEV;
@@ -74,9 +80,6 @@ void __init amigaone_setup_arch(void)
phb = amigaone_add_bridge(np);
 
BUG_ON(phb != 0);
-
-   if (ppc_md.progress)
-   ppc_md.progress("Linux/PPC "UTS_RELEASE"\n", 0);
 }
 
 void __init amigaone_init_IRQ(void)
@@ -159,6 +162,7 @@ define_machine(amigaone) {
.name   = "AmigaOne",
.probe  = amigaone_probe,
.setup_arch = amigaone_setup_arch,
+   .discover_phbs  = amigaone_discover_phbs,
.show_cpuinfo   = amigaone_show_cpuinfo,
.init_IRQ   = amigaone_init_IRQ,
.restart= amigaone_restart,
-- 
2.26.2



[PATCH 10/18] powerpc/83xx: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with mpc83xx_defconfig
---
 arch/powerpc/platforms/83xx/asp834x.c | 1 +
 arch/powerpc/platforms/83xx/km83xx.c  | 1 +
 arch/powerpc/platforms/83xx/misc.c| 2 --
 arch/powerpc/platforms/83xx/mpc830x_rdb.c | 1 +
 arch/powerpc/platforms/83xx/mpc831x_rdb.c | 1 +
 arch/powerpc/platforms/83xx/mpc832x_mds.c | 1 +
 arch/powerpc/platforms/83xx/mpc832x_rdb.c | 1 +
 arch/powerpc/platforms/83xx/mpc834x_itx.c | 1 +
 arch/powerpc/platforms/83xx/mpc834x_mds.c | 1 +
 arch/powerpc/platforms/83xx/mpc836x_mds.c | 1 +
 arch/powerpc/platforms/83xx/mpc836x_rdk.c | 1 +
 arch/powerpc/platforms/83xx/mpc837x_mds.c | 1 +
 arch/powerpc/platforms/83xx/mpc837x_rdb.c | 1 +
 13 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/83xx/asp834x.c 
b/arch/powerpc/platforms/83xx/asp834x.c
index 28474876f41b..68061c2a57c1 100644
--- a/arch/powerpc/platforms/83xx/asp834x.c
+++ b/arch/powerpc/platforms/83xx/asp834x.c
@@ -44,6 +44,7 @@ define_machine(asp834x) {
.name   = "ASP8347E",
.probe  = asp834x_probe,
.setup_arch = asp834x_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restart,
diff --git a/arch/powerpc/platforms/83xx/km83xx.c 
b/arch/powerpc/platforms/83xx/km83xx.c
index bcdc2c203ec9..108e1e4d2683 100644
--- a/arch/powerpc/platforms/83xx/km83xx.c
+++ b/arch/powerpc/platforms/83xx/km83xx.c
@@ -180,6 +180,7 @@ define_machine(mpc83xx_km) {
.name   = "mpc83xx-km-platform",
.probe  = mpc83xx_km_probe,
.setup_arch = mpc83xx_km_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restart,
diff --git a/arch/powerpc/platforms/83xx/misc.c 
b/arch/powerpc/platforms/83xx/misc.c
index a952e91db3ee..3285dabcf923 100644
--- a/arch/powerpc/platforms/83xx/misc.c
+++ b/arch/powerpc/platforms/83xx/misc.c
@@ -132,8 +132,6 @@ void __init mpc83xx_setup_arch(void)
setbat(-1, va, immrbase, immrsize, PAGE_KERNEL_NCG);
update_bats();
}
-
-   mpc83xx_setup_pci();
 }
 
 int machine_check_83xx(struct pt_regs *regs)
diff --git a/arch/powerpc/platforms/83xx/mpc830x_rdb.c 
b/arch/powerpc/platforms/83xx/mpc830x_rdb.c
index 51426e88ec67..956d4389effa 100644
--- a/arch/powerpc/platforms/83xx/mpc830x_rdb.c
+++ b/arch/powerpc/platforms/83xx/mpc830x_rdb.c
@@ -48,6 +48,7 @@ define_machine(mpc830x_rdb) {
.name   = "MPC830x RDB",
.probe  = mpc830x_rdb_probe,
.setup_arch = mpc830x_rdb_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restart,
diff --git a/arch/powerpc/platforms/83xx/mpc831x_rdb.c 
b/arch/powerpc/platforms/83xx/mpc831x_rdb.c
index 5ccd57a48492..3b578f080e3b 100644
--- a/arch/powerpc/platforms/83xx/mpc831x_rdb.c
+++ b/arch/powerpc/platforms/83xx/mpc831x_rdb.c
@@ -48,6 +48,7 @@ define_machine(mpc831x_rdb) {
.name   = "MPC831x RDB",
.probe  = mpc831x_rdb_probe,
.setup_arch = mpc831x_rdb_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restart,
diff --git a/arch/powerpc/platforms/83xx/mpc832x_mds.c 
b/arch/powerpc/platforms/83xx/mpc832x_mds.c
index 6fa5402ebf20..850d566ef900 100644
--- a/arch/powerpc/platforms/83xx/mpc832x_mds.c
+++ b/arch/powerpc/platforms/83xx/mpc832x_mds.c
@@ -101,6 +101,7 @@ define_machine(mpc832x_mds) {
.name   = "MPC832x MDS",
.probe  = mpc832x_sys_probe,
.setup_arch = mpc832x_sys_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restart,
diff --git a/arch/powerpc/platforms/83xx/mpc832x_rdb.c 
b/arch/powerpc/platforms/83xx/mpc832x_rdb.c
index 622c625d5ce4..b6133a237a70 100644
--- a/arch/powerpc/platforms/83xx/mpc832x_rdb.c
+++ b/arch/powerpc/platforms/83xx/mpc832x_rdb.c
@@ -219,6 +219,7 @@ define_machine(mpc832x_rdb) {
.name   = "MPC832x RDB",
.probe  = mpc832x_rdb_probe,
.setup_arch = mpc832x_rdb_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restar

[PATCH 09/18] powerpc/82xx/*: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with pq2fads_defconfig
---
 arch/powerpc/platforms/82xx/mpc8272_ads.c | 2 +-
 arch/powerpc/platforms/82xx/pq2fads.c | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/82xx/mpc8272_ads.c 
b/arch/powerpc/platforms/82xx/mpc8272_ads.c
index 3fe1a6593280..0b5b9dec16d5 100644
--- a/arch/powerpc/platforms/82xx/mpc8272_ads.c
+++ b/arch/powerpc/platforms/82xx/mpc8272_ads.c
@@ -171,7 +171,6 @@ static void __init mpc8272_ads_setup_arch(void)
iounmap(bcsr);
 
init_ioports();
-   pq2_init_pci();
 
if (ppc_md.progress)
ppc_md.progress("mpc8272_ads_setup_arch(), finish", 0);
@@ -205,6 +204,7 @@ define_machine(mpc8272_ads)
.name = "Freescale MPC8272 ADS",
.probe = mpc8272_ads_probe,
.setup_arch = mpc8272_ads_setup_arch,
+   .discover_phbs = pq2_init_pci,
.init_IRQ = mpc8272_ads_pic_init,
.get_irq = cpm2_get_irq,
.calibrate_decr = generic_calibrate_decr,
diff --git a/arch/powerpc/platforms/82xx/pq2fads.c 
b/arch/powerpc/platforms/82xx/pq2fads.c
index a74082140718..ac9113d524af 100644
--- a/arch/powerpc/platforms/82xx/pq2fads.c
+++ b/arch/powerpc/platforms/82xx/pq2fads.c
@@ -150,8 +150,6 @@ static void __init pq2fads_setup_arch(void)
/* Enable external IRQs */
clrbits32(_immr->im_siu_conf.siu_82xx.sc_siumcr, 0x0c00);
 
-   pq2_init_pci();
-
if (ppc_md.progress)
ppc_md.progress("pq2fads_setup_arch(), finish", 0);
 }
@@ -184,6 +182,7 @@ define_machine(pq2fads)
.name = "Freescale PQ2FADS",
.probe = pq2fads_probe,
.setup_arch = pq2fads_setup_arch,
+   .discover_phbs = pq2_init_pci,
.init_IRQ = pq2fads_pic_init,
.get_irq = cpm2_get_irq,
.calibrate_decr = generic_calibrate_decr,
-- 
2.26.2



[PATCH 08/18] powerpc/52xx/mpc5200_simple: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/52xx/mpc5200_simple.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/52xx/mpc5200_simple.c 
b/arch/powerpc/platforms/52xx/mpc5200_simple.c
index 2d01e9b2e779..b9f5675b0a1d 100644
--- a/arch/powerpc/platforms/52xx/mpc5200_simple.c
+++ b/arch/powerpc/platforms/52xx/mpc5200_simple.c
@@ -40,8 +40,6 @@ static void __init mpc5200_simple_setup_arch(void)
 
/* Some mpc5200 & mpc5200b related configuration */
mpc5200_setup_xlb_arbiter();
-
-   mpc52xx_setup_pci();
 }
 
 /* list of the supported boards */
@@ -73,6 +71,7 @@ define_machine(mpc5200_simple_platform) {
.name   = "mpc5200-simple-platform",
.probe  = mpc5200_simple_probe,
.setup_arch = mpc5200_simple_setup_arch,
+   .discover_phbs  = mpc52xx_setup_pci,
.init   = mpc52xx_declare_of_platform_devices,
.init_IRQ   = mpc52xx_init_irq,
.get_irq= mpc52xx_get_irq,
-- 
2.26.2



[PATCH 07/18] powerpc/52xx/media5200: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/52xx/media5200.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/52xx/media5200.c 
b/arch/powerpc/platforms/52xx/media5200.c
index 07c5bc4ed0b5..efb8bdecbcc7 100644
--- a/arch/powerpc/platforms/52xx/media5200.c
+++ b/arch/powerpc/platforms/52xx/media5200.c
@@ -202,8 +202,6 @@ static void __init media5200_setup_arch(void)
/* Some mpc5200 & mpc5200b related configuration */
mpc5200_setup_xlb_arbiter();
 
-   mpc52xx_setup_pci();
-
np = of_find_matching_node(NULL, mpc5200_gpio_ids);
gpio = of_iomap(np, 0);
of_node_put(np);
@@ -244,6 +242,7 @@ define_machine(media5200_platform) {
.name   = "media5200-platform",
.probe  = media5200_probe,
.setup_arch = media5200_setup_arch,
+   .discover_phbs  = mpc52xx_setup_pci,
.init   = mpc52xx_declare_of_platform_devices,
.init_IRQ   = media5200_init_irq,
.get_irq= mpc52xx_get_irq,
-- 
2.26.2



[PATCH 06/18] powerpc/52xx/lite5200: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with 52xx/lite5200b_defconfig
---
 arch/powerpc/platforms/52xx/lite5200.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/52xx/lite5200.c 
b/arch/powerpc/platforms/52xx/lite5200.c
index 3181aac08225..04cc97397095 100644
--- a/arch/powerpc/platforms/52xx/lite5200.c
+++ b/arch/powerpc/platforms/52xx/lite5200.c
@@ -165,8 +165,6 @@ static void __init lite5200_setup_arch(void)
mpc52xx_suspend.board_resume_finish = lite5200_resume_finish;
lite5200_pm_init();
 #endif
-
-   mpc52xx_setup_pci();
 }
 
 static const char * const board[] __initconst = {
@@ -187,6 +185,7 @@ define_machine(lite5200) {
.name   = "lite5200",
.probe  = lite5200_probe,
.setup_arch = lite5200_setup_arch,
+   .discover_phbs  = mpc52xx_setup_pci,
.init   = mpc52xx_declare_of_platform_devices,
.init_IRQ   = mpc52xx_init_irq,
.get_irq= mpc52xx_get_irq,
-- 
2.26.2



[PATCH 05/18] powerpc/52xx/efika: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with mpc5200_defconfig
---
 arch/powerpc/platforms/52xx/efika.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/52xx/efika.c 
b/arch/powerpc/platforms/52xx/efika.c
index 4514a6f7458a..3b7d70d71692 100644
--- a/arch/powerpc/platforms/52xx/efika.c
+++ b/arch/powerpc/platforms/52xx/efika.c
@@ -185,8 +185,6 @@ static void __init efika_setup_arch(void)
/* Map important registers from the internal memory map */
mpc52xx_map_common_devices();
 
-   efika_pcisetup();
-
 #ifdef CONFIG_PM
mpc52xx_suspend.board_suspend_prepare = efika_suspend_prepare;
mpc52xx_pm_init();
@@ -218,6 +216,7 @@ define_machine(efika)
.name   = EFIKA_PLATFORM_NAME,
.probe  = efika_probe,
.setup_arch = efika_setup_arch,
+   .discover_phbs  = efika_pcisetup,
.init   = mpc52xx_declare_of_platform_devices,
.show_cpuinfo   = efika_show_cpuinfo,
.init_IRQ   = mpc52xx_init_irq,
-- 
2.26.2



[PATCH 04/18] powerpc/512x: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
only compile tested
---
 arch/powerpc/platforms/512x/mpc5121_ads.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/512x/mpc5121_ads.c 
b/arch/powerpc/platforms/512x/mpc5121_ads.c
index 6303fbfc4e4f..9d030c2e0004 100644
--- a/arch/powerpc/platforms/512x/mpc5121_ads.c
+++ b/arch/powerpc/platforms/512x/mpc5121_ads.c
@@ -24,21 +24,23 @@
 
 static void __init mpc5121_ads_setup_arch(void)
 {
-#ifdef CONFIG_PCI
-   struct device_node *np;
-#endif
printk(KERN_INFO "MPC5121 ADS board from Freescale Semiconductor\n");
/*
 * cpld regs are needed early
 */
mpc5121_ads_cpld_map();
 
+   mpc512x_setup_arch();
+}
+
+static void __init mpc5121_ads_setup_pci(void)
+{
 #ifdef CONFIG_PCI
+   struct device_node *np;
+
for_each_compatible_node(np, "pci", "fsl,mpc5121-pci")
mpc83xx_add_bridge(np);
 #endif
-
-   mpc512x_setup_arch();
 }
 
 static void __init mpc5121_ads_init_IRQ(void)
@@ -64,6 +66,7 @@ define_machine(mpc5121_ads) {
.name   = "MPC5121 ADS",
.probe  = mpc5121_ads_probe,
.setup_arch = mpc5121_ads_setup_arch,
+   .discover_phbs  = mpc5121_ads_setup_pci,
.init   = mpc512x_init,
.init_IRQ   = mpc5121_ads_init_IRQ,
.get_irq= ipic_get_irq,
-- 
2.26.2



[PATCH 03/18] powerpc/maple: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/maple/setup.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/maple/setup.c 
b/arch/powerpc/platforms/maple/setup.c
index f7e66a2005b4..4e9ad5bf3efb 100644
--- a/arch/powerpc/platforms/maple/setup.c
+++ b/arch/powerpc/platforms/maple/setup.c
@@ -179,9 +179,6 @@ static void __init maple_setup_arch(void)
 #ifdef CONFIG_SMP
smp_ops = _smp_ops;
 #endif
-   /* Lookup PCI hosts */
-   maple_pci_init();
-
maple_use_rtas_reboot_and_halt_if_present();
 
printk(KERN_DEBUG "Using native/NAP idle loop\n");
@@ -351,6 +348,7 @@ define_machine(maple) {
.name   = "Maple",
.probe  = maple_probe,
.setup_arch = maple_setup_arch,
+   .discover_phbs  = maple_pci_init,
.init_IRQ   = maple_init_IRQ,
.pci_irq_fixup  = maple_pci_irq_fixup,
.pci_get_legacy_ide_irq = maple_pci_get_legacy_ide_irq,
-- 
2.26.2



[PATCH 02/18] powerpc/{powernv,pseries}: Move PHB discovery

2020-11-02 Thread Oliver O'Halloran
Make powernv and pseries use ppc_mc.discover_phbs. These two platforms need
to be done together because they both depends on pci_dn's being created
from the DT. The pci_dn contains a pointer to the relevant pci_controller
so they need to be created after the pci_controller structures are
available, but before  and before PCI devices are scanned. Currently this
ordering is provided by initcalls and the sequence is:

1. PHBs are discovered (setup_arch) (early boot, pre-initcalls)
2. pci_dn are created from the unflattended DT (core initcall)
3. PHBs are scanned pcibios_init() (subsys initcall)

The new ppc_md.discover_phbs() function is also a core_initcall so we can't
guarantee ordering between the creations of pci_controllers and the
creation of pci_dn's which require a pci_controller. We could use the
postcore, or core_sync initcall levels, but it's cleaner to just move the
pci_dn setup into the per-PHB inits which occur inside of .discover_phb()
for these platforms. This brings the boot-time path in line with the PHB
hotplug path that is used for pseries DLPAR operations too.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/pci_dn.c  | 22 --
 arch/powerpc/platforms/powernv/pci-ioda.c |  3 +++
 arch/powerpc/platforms/powernv/setup.c|  4 +---
 arch/powerpc/platforms/pseries/setup.c|  7 +--
 4 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 54e240597fd9..61571ae23953 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -481,28 +481,6 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
pci_traverse_device_nodes(dn, add_pdn, phb);
 }
 
-/** 
- * pci_devs_phb_init - Initialize phbs and pci devs under them.
- * 
- * This routine walks over all phb's (pci-host bridges) on the
- * system, and sets up assorted pci-related structures 
- * (including pci info in the device node structs) for each
- * pci device found underneath.  This routine runs once,
- * early in the boot sequence.
- */
-static int __init pci_devs_phb_init(void)
-{
-   struct pci_controller *phb, *tmp;
-
-   /* This must be done first so the device nodes have valid pci info! */
-   list_for_each_entry_safe(phb, tmp, _list, list_node)
-   pci_devs_phb_init_dynamic(phb);
-
-   return 0;
-}
-
-core_initcall(pci_devs_phb_init);
-
 static void pci_dev_pdn_setup(struct pci_dev *pdev)
 {
struct pci_dn *pdn;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2b4ceb5e6ce4..d6815f03fee3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3176,6 +3176,9 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
/* Remove M64 resource if we can't configure it successfully */
if (!phb->init_m64 || phb->init_m64(phb))
hose->mem_resources[1].flags = 0;
+
+   /* create pci_dn's for DT nodes under this PHB */
+   pci_devs_phb_init_dynamic(hose);
 }
 
 void __init pnv_pci_init_ioda2_phb(struct device_node *np)
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index 9acaa0f131b9..92f5fa827909 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -162,9 +162,6 @@ static void __init pnv_setup_arch(void)
/* Initialize SMP */
pnv_smp_init();
 
-   /* Setup PCI */
-   pnv_pci_init();
-
/* Setup RTC and NVRAM callbacks */
if (firmware_has_feature(FW_FEATURE_OPAL))
opal_nvram_init();
@@ -524,6 +521,7 @@ define_machine(powernv) {
.init_IRQ   = pnv_init_IRQ,
.show_cpuinfo   = pnv_show_cpuinfo,
.get_proc_freq  = pnv_get_proc_freq,
+   .discover_phbs  = pnv_pci_init,
.progress   = pnv_progress,
.machine_shutdown   = pnv_shutdown,
.power_save = NULL,
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 633c45ec406d..e88b30d4b6cd 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -463,7 +463,7 @@ void pseries_little_endian_exceptions(void)
 }
 #endif
 
-static void __init find_and_init_phbs(void)
+static void __init pSeries_discover_phbs(void)
 {
struct device_node *node;
struct pci_controller *phb;
@@ -481,6 +481,9 @@ static void __init find_and_init_phbs(void)
pci_process_bridge_OF_ranges(phb, node, 0);
isa_bridge_find_early(phb);
phb->controller_ops = pseries_pci_controller_ops;
+
+   /* create pci_dn's for DT nodes under this PHB */
+   pci_devs_phb_init_dynamic(phb);
}
 
of_node_put(root);
@@ -777,7 +780,6 @@ static void __init pSeries_set

[PATCH 01/18] powerpc/pci: Add ppc_md.discover_phbs()

2020-11-02 Thread Oliver O'Halloran
On many powerpc platforms the discovery and initalisation of
pci_controllers (PHBs) happens inside of setup_arch(). This is very early
in boot (pre-initcalls) and means that we're initialising the PHB long
before many basic kernel services (slab allocator, debugfs, a real ioremap)
are available.

On PowerNV this causes an additional problem since we map the PHB registers
with ioremap(). As of commit d538aadc2718 ("powerpc/ioremap: warn on early
use of ioremap()") a warning is printed because we're using the "incorrect"
API to setup and MMIO mapping in searly boot. The kernel does provide
early_ioremap(), but that is not intended to create long-lived MMIO
mappings and a seperate warning is printed by generic code if
early_ioremap() mappings are "leaked."

This is all fixable with dumb hacks like using early_ioremap() to setup
the initial mapping then replacing it with a real ioremap later on in
boot, but it does raise the question: Why the hell are we setting up the
PHB's this early in boot?

The old and wise claim it's due to "hysterical rasins." Aside from amused
grapes there doesn't appear to be any real reason to maintain the current
behaviour. Already most of the newer embedded platforms perform PHB
discovery in an arch_initcall and between the end of setup_arch() and the
start of initcalls none of the generic kernel code does anything PCI
related. On powerpc scanning PHBs occurs in a subsys_initcall so it should
be possible to move the PHB discovery to a core, postcore or arch initcall.

This patch adds the ppc_md.discover_phbs hook and a core_initcall stub that
calls it. The core_initcalls are the earliest to be called so this will
any possibly issues with dependency between initcalls. This isn't just an
academic issue either since on pseries and PowerNV EEH init occurs in an
arch_initcall and depends on the pci_controllers being available, similarly
the creation of pci_dns occurs at core_initcall_sync (i.e. between core and
postcore initcalls). These problems need to be addressed seperately.

Cc: Paul Mackerras 
Cc: Christophe Leroy 
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/include/asm/machdep.h |  3 +++
 arch/powerpc/kernel/pci-common.c   | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 475687f24f4a..d319160d790c 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -59,6 +59,9 @@ struct machdep_calls {
int (*pcibios_root_bridge_prepare)(struct pci_host_bridge
*bridge);
 
+   /* finds all the pci_controllers present at boot */
+   void(*discover_phbs)(void);
+
/* To setup PHBs when using automatic OF platform driver for PCI */
int (*pci_setup_phb)(struct pci_controller *host);
 
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index be108616a721..6265e7d1c697 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1625,3 +1625,13 @@ static void fixup_hide_host_resource_fsl(struct pci_dev 
*dev)
 }
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MOTOROLA, PCI_ANY_ID, 
fixup_hide_host_resource_fsl);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, PCI_ANY_ID, 
fixup_hide_host_resource_fsl);
+
+
+int __init discover_phbs(void)
+{
+   if (ppc_md.discover_phbs)
+   ppc_md.discover_phbs();
+
+   return 0;
+}
+core_initcall(discover_phbs);
-- 
2.26.2



Re: [PATCH] powerpc/eeh_cache: Fix a possible debugfs deadlock

2020-10-28 Thread Oliver O'Halloran
On Thu, Oct 29, 2020 at 2:27 AM Qian Cai  wrote:
>
> Lockdep complains that a possible deadlock below in
> eeh_addr_cache_show() because it is acquiring a lock with IRQ enabled,
> but eeh_addr_cache_insert_dev() needs to acquire the same lock with IRQ
> disabled. Let's just make eeh_addr_cache_show() acquire the lock with
> IRQ disabled as well.
>
> CPU0CPU1
> 
>lock(_io_addr_cache_root.piar_lock);
> local_irq_disable();
> lock(>lock);
> lock(_io_addr_cache_root.piar_lock);
>
>  lock(>lock);
>
>   *** DEADLOCK ***
>
>   lock_acquire+0x140/0x5f0
>   _raw_spin_lock_irqsave+0x64/0xb0
>   eeh_addr_cache_insert_dev+0x48/0x390
>   eeh_probe_device+0xb8/0x1a0
>   pnv_pcibios_bus_add_device+0x3c/0x80
>   pcibios_bus_add_device+0x118/0x290
>   pci_bus_add_device+0x28/0xe0
>   pci_bus_add_devices+0x54/0xb0
>   pcibios_init+0xc4/0x124
>   do_one_initcall+0xac/0x528
>   kernel_init_freeable+0x35c/0x3fc
>   kernel_init+0x24/0x148
>   ret_from_kernel_thread+0x5c/0x80
>
>   lock_acquire+0x140/0x5f0
>   _raw_spin_lock+0x4c/0x70
>   eeh_addr_cache_show+0x38/0x110
>   seq_read+0x1a0/0x660
>   vfs_read+0xc8/0x1f0
>   ksys_read+0x74/0x130
>   system_call_exception+0xf8/0x1d0
>   system_call_common+0xe8/0x218
>
> Fixes: 5ca85ae6318d ("powerpc/eeh_cache: Add a way to dump the EEH address 
> cache")
> Signed-off-by: Qian Cai 

Good catch,

Reviewed-by: Oliver O'Halloran 


[PATCH] powerpc/eeh: Fix eeh_dev_check_failure() for PE#0

2020-10-21 Thread Oliver O'Halloran
In commit 269e583357df ("powerpc/eeh: Delete eeh_pe->config_addr") the
following simplification was made:

-   if (!pe->addr && !pe->config_addr) {
+   if (!pe->addr) {
eeh_stats.no_cfg_addr++;
return 0;
}

This introduced a bug which causes EEH checking to be skipped for devices
in PE#0.

Before the change above the check would always pass since atleast one of
the two PE addresses would be non-zero in all circumstances. On PowerNV
pe->config_addr was the be the BDFN of the first device added to the PE.
The zero BDFN is reserved for the PHB's root port, but this is fine since
for obscure platform reasons the root port is never assigned to PE#0.

Similarly, on pseries pe->addr has always been non-zero for the reasons
outlined in commit 42de19d5ef71 ("powerpc/pseries/eeh: Allow zero to be a
valid PE configuration address").

We can fix the problem by deleting the block entirely The original
purpose of this test was to avoid performing EEH checks on devices there
were not on an EEH capable bus. In modern Linux the edev->pe pointer will
be NULL for devices that are not on an EEH capable bus. The code block
immediately above this one already checks for the edev->pe == NULL case
so this test (new and old) is entirely redundant.

Ideally we'd delete eeh_stats.no_cfg_addr too since nothing increments it
any more. Unfortunately, that information is exposed via /proc/powerpc/eeh
which means it's technically ABI. We could make it hard-coded, but that's
a change for another patch.

Fixes: 269e583357df ("powerpc/eeh: Delete eeh_pe->config_addr")
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/eeh.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 4da880759a8b..7dd03d47693f 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -470,11 +470,6 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
return 0;
}
 
-   if (!pe->addr) {
-   eeh_stats.no_cfg_addr++;
-   return 0;
-   }
-
/*
 * On PowerNV platform, we might already have fenced PHB
 * there and we need take care of that firstly.
-- 
2.26.2



Re: [PATCH -next] Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"

2020-10-14 Thread Oliver O'Halloran
On Thu, Oct 15, 2020 at 5:28 AM Qian Cai  wrote:
>
> This reverts commit 3a3181e16fbde752007759f8759d25e0ff1fc425 which
> causes memory corruptions on POWER9 NV.

I was going to post this along with a fix for Cedric's original bug,
but I can do that separately so:

Acked-by: Oliver O'Halloran 


[PATCH] selftests/powerpc: Fix eeh-basic.sh exit codes

2020-10-13 Thread Oliver O'Halloran
The kselftests test running infrastructure expects tests to finish with an
exit code of 4 if the test decided it should be skipped. Currently
eeh-basic.sh exits with the number of devices that failed to recover, so if
four devices didn't recover we'll report a skip instead of a fail.

Fix this by checking if the return code is non-zero and report success
and failure by returning 0 or 1 respectively. For the cases where should
actually skip return 4.

Fixes: 85d86c8aa52e ("selftests/powerpc: Add basic EEH selftest")
Signed-off-by: Oliver O'Halloran 
---
 tools/testing/selftests/powerpc/eeh/eeh-basic.sh | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/powerpc/eeh/eeh-basic.sh 
b/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
index 8a8d0f456946..0d783e1065c8 100755
--- a/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
+++ b/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
@@ -1,17 +1,19 @@
 #!/bin/sh
 # SPDX-License-Identifier: GPL-2.0-only
 
+KSELFTESTS_SKIP=4
+
 . ./eeh-functions.sh
 
 if ! eeh_supported ; then
echo "EEH not supported on this system, skipping"
-   exit 0;
+   exit $KSELFTESTS_SKIP;
 fi
 
 if [ ! -e "/sys/kernel/debug/powerpc/eeh_dev_check" ] && \
[ ! -e "/sys/kernel/debug/powerpc/eeh_dev_break" ] ; then
echo "debugfs EEH testing files are missing. Is debugfs mounted?"
-   exit 1;
+   exit $KSELFTESTS_SKIP;
 fi
 
 pre_lspci=`mktemp`
@@ -84,4 +86,5 @@ echo "$failed devices failed to recover ($dev_count tested)"
 lspci | diff -u $pre_lspci -
 rm -f $pre_lspci
 
-exit $failed
+test "$failed" == 0
+exit $?
-- 
2.26.2



Re: [PATCH] powerpc/pci: Fix PHB removal/rescan on PowerNV

2020-10-07 Thread Oliver O'Halloran
On Fri, Sep 25, 2020 at 7:23 PM Cédric Le Goater  wrote:
>
> To fix an issue with PHB hotplug on pSeries machine (HPT/XIVE), commit
> 3a3181e16fbd introduced a PPC specific pcibios_remove_bus() routine to
> clear all interrupt mappings when the bus is removed. This routine
> frees an array allocated in pcibios_scan_phb().
>
> This broke PHB hotplug on PowerNV because, when a PHB is removed and
> re-scanned through sysfs, the PCI layer un-assigns and re-assigns
> resources to the PHB but does not destroy and recreate the PCI
> controller structure. Since pcibios_remove_bus() does not clear the
> 'irq_map' array pointer, a second removal of the PHB will try to free
> the array a second time and corrupt memory.

"PHB hotplug" and "hot-plugging devices under a PHB" are different
things. What you're saying here doesn't make a whole lot of sense to
me unless you're conflating the two. The distinction is important
since on pseries we can use DLPAR to add and remove actual PHBs (i.e.
the pci_controller) at runtime, but there's no corresponding mechanism
on PowerNV.

> Free the 'irq_map' array in pcibios_free_controller() to fix
> corruption and clear interrupt mapping after it has been
> disposed. This to avoid filling up the array with successive
> remove/rescan of a bus.

Even with this patch I think we're still broken. With this patch
applied the init path is something like:

per-phb init:
allocate phb->irq_map array
per-bus init:
nothing
per-device init:
pcibios_bus_add_device()
   pci_read_irq_line()
pci_irq_map_register(pci_dev, virq)
   *record the device's interrupt in phb->irq_map*

And the teardown path:

per-device teardown:
nothing
per-bus teardown:
pcibios_remove_bus()
pci_irq_map_dispose()
*walk phb->irq_map and dispose of each mapped interrupt*
per-phb teardown:
free(phb->irq_map)

There's a lot of asymmetry here, which is a problem in itself, but the
real issue is that when removing *any* pci_bus under a PHB we dispose
of the LSI\ for *every* device under that PHB. Not good.

Ideally we should be fixing this by having the per-device teardown
handle disposing the mapping. Unfortunately, there's no pcibios hook
that's called when removing a pci_dev. However, we can register a bus
notifier which will be called when the pci_dev is removed from its bus
and we already do that for the per-device EEH teardown and to handle
IOMMU TCE invalidation when the device is removed.


[PATCH 2/2] powerpc/pseries/eeh: Fix use of uninitialised variable

2020-10-06 Thread Oliver O'Halloran
If the RTAS call to query the PE address for a device fails we jump the
err: label where an error message is printed along with the return code.
However, the printed return code is from the "ret" variable which isn't set
at that point since we assigned the result to "addr" instead. Fix this by
consistently using the "ret" variable for the result of the RTAS call
helpers an dropping the "addr" local variable"

Fixes: 98ba956f6a38 ("powerpc/pseries/eeh: Rework device EEH PE determination")
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/pseries/eeh_pseries.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
b/arch/powerpc/platforms/pseries/eeh_pseries.c
index d8fd5f7b2143..cf024fa37bda 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -363,7 +363,6 @@ void pseries_eeh_init_edev(struct pci_dn *pdn)
 {
struct eeh_pe pe, *parent;
struct eeh_dev *edev;
-   int addr;
u32 pcie_flags;
int ret;
 
@@ -422,8 +421,8 @@ void pseries_eeh_init_edev(struct pci_dn *pdn)
}
 
/* first up, find the pe_config_addr for the PE containing the device */
-   addr = pseries_eeh_get_pe_config_addr(pdn);
-   if (addr < 0) {
+   ret = pseries_eeh_get_pe_config_addr(pdn);
+   if (ret < 0) {
eeh_edev_dbg(edev, "Unable to find pe_config_addr\n");
goto err;
}
@@ -431,7 +430,7 @@ void pseries_eeh_init_edev(struct pci_dn *pdn)
/* Try enable EEH on the fake PE */
memset(, 0, sizeof(struct eeh_pe));
pe.phb = pdn->phb;
-   pe.addr = addr;
+   pe.addr = ret;
 
eeh_edev_dbg(edev, "Enabling EEH on device\n");
ret = eeh_ops->set_option(, EEH_OPT_ENABLE);
@@ -440,7 +439,7 @@ void pseries_eeh_init_edev(struct pci_dn *pdn)
goto err;
}
 
-   edev->pe_config_addr = addr;
+   edev->pe_config_addr = pe.addr;
 
eeh_add_flag(EEH_ENABLED);
 
-- 
2.26.2



[PATCH 1/2] powerpc/eeh: Delete eeh_pe->config_addr

2020-10-06 Thread Oliver O'Halloran
The eeh_pe->config_addr field was supposed to be removed in
commit 35d64734b643 ("powerpc/eeh: Clean up PE addressing") which made it
largely unused. Finish the job.

Signed-off-by: Oliver O'Halloran 
---
mpe: the Fixes SHA is from ppc/next, fold it into that if you want.
---
 arch/powerpc/include/asm/eeh.h | 1 -
 arch/powerpc/kernel/eeh.c  | 2 +-
 arch/powerpc/kernel/eeh_pe.c   | 4 ++--
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index dd6a4ac6c713..b1a5bba2e0b9 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -73,7 +73,6 @@ struct pci_dn;
 struct eeh_pe {
int type;   /* PE type: PHB/Bus/Device  */
int state;  /* PE EEH dependent mode*/
-   int config_addr;/* Traditional PCI address  */
int addr;   /* PE configuration address */
struct pci_controller *phb; /* Associated PHB   */
struct pci_bus *bus;/* Top PCI bus for bus PE   */
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 87de8b798b2d..0e160dffcb86 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -466,7 +466,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
return 0;
}
 
-   if (!pe->addr && !pe->config_addr) {
+   if (!pe->addr) {
eeh_stats.no_cfg_addr++;
return 0;
}
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 61b7d4019051..845e024321d4 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -354,8 +354,8 @@ int eeh_pe_tree_insert(struct eeh_dev *edev, struct eeh_pe 
*new_pe_parent)
pr_err("%s: out of memory!\n", __func__);
return -ENOMEM;
}
-   pe->addr= edev->pe_config_addr;
-   pe->config_addr = edev->bdfn;
+
+   pe->addr = edev->pe_config_addr;
 
/*
 * Put the new EEH PE into hierarchy tree. If the parent
-- 
2.26.2



Re: [PATCH v2] powernv/elog: Fix the race while processing OPAL error log event.

2020-10-05 Thread Oliver O'Halloran
On Mon, Oct 5, 2020 at 3:12 PM Mahesh Salgaonkar  wrote:
>
> Every error log reported by OPAL is exported to userspace through a sysfs
> interface and notified using kobject_uevent(). The userspace daemon
> (opal_errd) then reads the error log and acknowledges it error log is saved
> safely to disk. Once acknowledged the kernel removes the respective sysfs
> file entry causing respective resources getting released including kobject.
>
> However there are chances where user daemon may already be scanning elog
> entries while new sysfs elog entry is being created by kernel. User daemon
> may read this new entry and ack it even before kernel can notify userspace
> about it through kobject_uevent() call. If that happens then we have a
> potential race between elog_ack_store->kobject_put() and kobject_uevent
> which can lead to use-after-free issue of a kernfs object resulting into a
> kernel crash. This patch fixes this race by protecting a sysfs file
> creation/notification by holding an additional reference count on kobject
> until we safely send kobject_uevent().
>
> Reported-by: Oliver O'Halloran 
> Signed-off-by: Mahesh Salgaonkar 
> Signed-off-by: Aneesh Kumar K.V 
> ---
> Change in v2:
> - Instead of mutex and use extra reference count on kobject to avoid the
>   race.
> ---
>  arch/powerpc/platforms/powernv/opal-elog.c |   15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/arch/powerpc/platforms/powernv/opal-elog.c 
> b/arch/powerpc/platforms/powernv/opal-elog.c
> index 62ef7ad995da..230f102e87c0 100644
> --- a/arch/powerpc/platforms/powernv/opal-elog.c
> +++ b/arch/powerpc/platforms/powernv/opal-elog.c
> @@ -222,13 +222,28 @@ static struct elog_obj *create_elog_obj(uint64_t id, 
> size_t size, uint64_t type)
> return NULL;
> }
>
> +   /*
> +* As soon as sysfs file for this elog is created/activated there is
> +* chance opal_errd daemon might read and acknowledge this elog before
> +* kobject_uevent() is called. If that happens then we have a 
> potential
> +* race between elog_ack_store->kobject_put() and kobject_uevent which
> +* leads to use-after-free issue of a kernfs object resulting into
> +* kernel crash. To avoid this race take an additional reference count
> +* on kobject until we safely send kobject_uevent().
> +*/
> +
> +   kobject_get(>kobj);  /* extra reference count */
> rc = sysfs_create_bin_file(>kobj, >raw_attr);
> if (rc) {
> +   kobject_put(>kobj);
> +   /* Drop the extra reference count  */
> kobject_put(>kobj);
>     return NULL;
> }
>
> kobject_uevent(>kobj, KOBJ_ADD);
> +   /* Drop the extra reference count  */
> +   kobject_put(>kobj);

Makes sense,

Reviewed-by: Oliver O'Halloran 

>
> return elog;

Does the returned value actually get used anywhere? We'd have a
similar use-after-free problem if it does. This should probably return
an error code rather than the object itself.


Re: [PATCH v2] powernv/elog: Fix the race while processing OPAL error log event.

2020-10-05 Thread Oliver O'Halloran
On Mon, Oct 5, 2020 at 11:07 PM Ananth N Mavinakayanahalli
 wrote:
>
> On 10/5/20 9:42 AM, Mahesh Salgaonkar wrote:
> > Every error log reported by OPAL is exported to userspace through a sysfs
> > interface and notified using kobject_uevent(). The userspace daemon
> > (opal_errd) then reads the error log and acknowledges it error log is saved
> > safely to disk. Once acknowledged the kernel removes the respective sysfs
> > file entry causing respective resources getting released including kobject.
> >
> > However there are chances where user daemon may already be scanning elog
> > entries while new sysfs elog entry is being created by kernel. User daemon
> > may read this new entry and ack it even before kernel can notify userspace
> > about it through kobject_uevent() call. If that happens then we have a
> > potential race between elog_ack_store->kobject_put() and kobject_uevent
> > which can lead to use-after-free issue of a kernfs object resulting into a
> > kernel crash. This patch fixes this race by protecting a sysfs file
> > creation/notification by holding an additional reference count on kobject
> > until we safely send kobject_uevent().
> >
> > Reported-by: Oliver O'Halloran 
> > Signed-off-by: Mahesh Salgaonkar 
> > Signed-off-by: Aneesh Kumar K.V 
>
> cc stable?

+1


Re: [PATCH] rpadlpar_io:Add MODULE_DESCRIPTION entries to kernel modules

2020-09-28 Thread Oliver O'Halloran
On Tue, Sep 29, 2020 at 6:50 AM Tyrel Datwyler  wrote:
>
> On 9/23/20 11:41 PM, Oliver O'Halloran wrote:
> > On Thu, Sep 24, 2020 at 3:15 PM Mamatha Inamdar
> >  wrote:
> >>
> >> This patch adds a brief MODULE_DESCRIPTION to rpadlpar_io kernel modules
> >> (descriptions taken from Kconfig file)
> >>
> >> Signed-off-by: Mamatha Inamdar 
> >> ---
> >>  drivers/pci/hotplug/rpadlpar_core.c |1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
> >> b/drivers/pci/hotplug/rpadlpar_core.c
> >> index f979b70..bac65ed 100644
> >> --- a/drivers/pci/hotplug/rpadlpar_core.c
> >> +++ b/drivers/pci/hotplug/rpadlpar_core.c
> >> @@ -478,3 +478,4 @@ static void __exit rpadlpar_io_exit(void)
> >>  module_init(rpadlpar_io_init);
> >>  module_exit(rpadlpar_io_exit);
> >>  MODULE_LICENSE("GPL");
> >> +MODULE_DESCRIPTION("RPA Dynamic Logical Partitioning driver for I/O 
> >> slots");
> >
> > RPA as a spec was superseded by PAPR in the early 2000s. Can we rename
> > this already?
>
> I seem to recall Michael and I discussed the naming briefly when I added the
> maintainer entries for the drivers and that the PAPR acronym is almost as
> meaningless to most as the original RPA. While, IBM no longer uses the term
> pseries for Power hardware marketing it is the defacto platform identifier in
> the Linux kernel tree for what we would call PAPR compliant. All in all I have
> no problem with renaming, but maybe we should consider pseries_dlpar or even
> simpler ibmdlpar.

I'm not too bothered by what we call it so long as it's consistent
with *something* else in the tree. Using pseries rather than ibm as a
prefix would probably be better since the legacy ibmphp driver is in
the same directory.


Re: [PATCH] rpadlpar_io:Add MODULE_DESCRIPTION entries to kernel modules

2020-09-27 Thread Oliver O'Halloran
On Sat, Sep 26, 2020 at 5:43 AM Bjorn Helgaas  wrote:
>
> On Thu, Sep 24, 2020 at 04:41:39PM +1000, Oliver O'Halloran wrote:
> > On Thu, Sep 24, 2020 at 3:15 PM Mamatha Inamdar
> >  wrote:
> > >
> > > This patch adds a brief MODULE_DESCRIPTION to rpadlpar_io kernel modules
> > > (descriptions taken from Kconfig file)
> > >
> > > Signed-off-by: Mamatha Inamdar 
> > > ---
> > >  drivers/pci/hotplug/rpadlpar_core.c |1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
> > > b/drivers/pci/hotplug/rpadlpar_core.c
> > > index f979b70..bac65ed 100644
> > > --- a/drivers/pci/hotplug/rpadlpar_core.c
> > > +++ b/drivers/pci/hotplug/rpadlpar_core.c
> > > @@ -478,3 +478,4 @@ static void __exit rpadlpar_io_exit(void)
> > >  module_init(rpadlpar_io_init);
> > >  module_exit(rpadlpar_io_exit);
> > >  MODULE_LICENSE("GPL");
> > > +MODULE_DESCRIPTION("RPA Dynamic Logical Partitioning driver for I/O 
> > > slots");
> >
> > RPA as a spec was superseded by PAPR in the early 2000s. Can we rename
> > this already?
> >
> > The only potential problem I can see is scripts doing: modprobe
> > rpadlpar_io or similar
> >
> > However, we should be able to fix that with a module alias.
>
> Is MODULE_DESCRIPTION() connected with how modprobe works?

I don't think so. The description is just there as an FYI.

> If this patch just improves documentation, without breaking users of
> modprobe, I'm fine with it, even if it would be nice to rename to PAPR
> or something in the future.

Right, the change in this patch is just a documentation fix and
shouldn't cause any problems.

I was suggesting renaming the module itself since the term "RPA" is
only used in this hotplug driver and some of the corresponding PHB add
/ remove handling in arch/powerpc/platforms/pseries/. We can make that
change in a follow up though.

> But, please use "git log --oneline drivers/pci/hotplug/rpadlpar*" and
> match the style, and also look through the rest of drivers/pci/ to see
> if we should do the same thing to any other modules.
>
> Bjorn


Re: [PATCH] rpadlpar_io:Add MODULE_DESCRIPTION entries to kernel modules

2020-09-24 Thread Oliver O'Halloran
On Thu, Sep 24, 2020 at 3:15 PM Mamatha Inamdar
 wrote:
>
> This patch adds a brief MODULE_DESCRIPTION to rpadlpar_io kernel modules
> (descriptions taken from Kconfig file)
>
> Signed-off-by: Mamatha Inamdar 
> ---
>  drivers/pci/hotplug/rpadlpar_core.c |1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
> b/drivers/pci/hotplug/rpadlpar_core.c
> index f979b70..bac65ed 100644
> --- a/drivers/pci/hotplug/rpadlpar_core.c
> +++ b/drivers/pci/hotplug/rpadlpar_core.c
> @@ -478,3 +478,4 @@ static void __exit rpadlpar_io_exit(void)
>  module_init(rpadlpar_io_init);
>  module_exit(rpadlpar_io_exit);
>  MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("RPA Dynamic Logical Partitioning driver for I/O slots");

RPA as a spec was superseded by PAPR in the early 2000s. Can we rename
this already?

The only potential problem I can see is scripts doing: modprobe
rpadlpar_io or similar

However, we should be able to fix that with a module alias.

Oliver


[RFC PATCH 18/18] powerpc/powermac: Move PHB discovery

2020-09-24 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with pmac32_defconfig and g5_defconfig
---
 arch/powerpc/platforms/powermac/setup.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powermac/setup.c 
b/arch/powerpc/platforms/powermac/setup.c
index f002b0fa69b8..0f8669139a21 100644
--- a/arch/powerpc/platforms/powermac/setup.c
+++ b/arch/powerpc/platforms/powermac/setup.c
@@ -298,9 +298,6 @@ static void __init pmac_setup_arch(void)
of_node_put(ic);
}
 
-   /* Lookup PCI hosts */
-   pmac_pci_init();
-
 #ifdef CONFIG_PPC32
ohare_init();
l2cr_init();
@@ -600,6 +597,7 @@ define_machine(powermac) {
.name   = "PowerMac",
.probe  = pmac_probe,
.setup_arch = pmac_setup_arch,
+   .discover_phbs  = pmac_pci_init,
.show_cpuinfo   = pmac_show_cpuinfo,
.init_IRQ   = pmac_pic_init,
.get_irq= NULL, /* changed later */
-- 
2.26.2



[RFC PATCH 17/18] powerpc/pasemi: Move PHB discovery

2020-09-24 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with pasemi_defconfig
---
 arch/powerpc/platforms/pasemi/setup.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pasemi/setup.c 
b/arch/powerpc/platforms/pasemi/setup.c
index b612474f8f8e..376797eb7894 100644
--- a/arch/powerpc/platforms/pasemi/setup.c
+++ b/arch/powerpc/platforms/pasemi/setup.c
@@ -144,8 +144,6 @@ static void __init pas_setup_arch(void)
/* Setup SMP callback */
smp_ops = _smp_ops;
 #endif
-   /* Lookup PCI hosts */
-   pas_pci_init();
 
/* Remap SDC register for doing reset */
/* XXXOJN This should maybe come out of the device tree */
@@ -446,6 +444,7 @@ define_machine(pasemi) {
.name   = "PA Semi PWRficient",
.probe  = pas_probe,
.setup_arch = pas_setup_arch,
+   .discover_phbs  = pas_pci_init,
.init_IRQ   = pas_init_IRQ,
.get_irq= mpic_get_irq,
.restart= pas_restart,
-- 
2.26.2



[RFC PATCH 16/18] powerpc/embedded6xx/mve5100: Move PHB discovery

2020-09-24 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with mvme5100_defconfig
---
 arch/powerpc/platforms/embedded6xx/mvme5100.c   | 13 -
 arch/powerpc/platforms/embedded6xx/storcenter.c |  8 ++--
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/embedded6xx/mvme5100.c 
b/arch/powerpc/platforms/embedded6xx/mvme5100.c
index 1cd488daa0bf..c06a0490d157 100644
--- a/arch/powerpc/platforms/embedded6xx/mvme5100.c
+++ b/arch/powerpc/platforms/embedded6xx/mvme5100.c
@@ -154,17 +154,19 @@ static const struct of_device_id mvme5100_of_bus_ids[] 
__initconst = {
  */
 static void __init mvme5100_setup_arch(void)
 {
-   struct device_node *np;
-
if (ppc_md.progress)
ppc_md.progress("mvme5100_setup_arch()", 0);
 
-   for_each_compatible_node(np, "pci", "hawk-pci")
-   mvme5100_add_bridge(np);
-
restart = ioremap(BOARD_MODRST_REG, 4);
 }
 
+static void __init mvme5100_setup_pci(void)
+{
+   struct device_node *np;
+
+   for_each_compatible_node(np, "pci", "hawk-pci")
+   mvme5100_add_bridge(np);
+}
 
 static void mvme5100_show_cpuinfo(struct seq_file *m)
 {
@@ -205,6 +207,7 @@ define_machine(mvme5100) {
.name   = "MVME5100",
.probe  = mvme5100_probe,
.setup_arch = mvme5100_setup_arch,
+   .discover_phbs  = mvme5100_setup_pci,
.init_IRQ   = mvme5100_pic_init,
.show_cpuinfo   = mvme5100_show_cpuinfo,
.get_irq= mpic_get_irq,
diff --git a/arch/powerpc/platforms/embedded6xx/storcenter.c 
b/arch/powerpc/platforms/embedded6xx/storcenter.c
index ed1914dd34bb..e8c5de54b0e1 100644
--- a/arch/powerpc/platforms/embedded6xx/storcenter.c
+++ b/arch/powerpc/platforms/embedded6xx/storcenter.c
@@ -65,14 +65,17 @@ static int __init storcenter_add_bridge(struct device_node 
*dev)
 }
 
 static void __init storcenter_setup_arch(void)
+{
+   printk(KERN_INFO "IOMEGA StorCenter\n");
+}
+
+static void __init storcenter_setup_pci(void)
 {
struct device_node *np;
 
/* Lookup PCI host bridges */
for_each_compatible_node(np, "pci", "mpc10x-pci")
storcenter_add_bridge(np);
-
-   printk(KERN_INFO "IOMEGA StorCenter\n");
 }
 
 /*
@@ -116,6 +119,7 @@ define_machine(storcenter){
.name   = "IOMEGA StorCenter",
.probe  = storcenter_probe,
.setup_arch = storcenter_setup_arch,
+   .discover_phbs  = storcenter_setup_pci,
.init_IRQ   = storcenter_init_IRQ,
.get_irq= mpic_get_irq,
.restart= storcenter_restart,
-- 
2.26.2



[RFC PATCH 15/18] powerpc/embedded6xx/mpc7448: Move PHB discovery

2020-09-24 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with mpc7448_hpc2_defconfig
---
 arch/powerpc/platforms/embedded6xx/mpc7448_hpc2.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/embedded6xx/mpc7448_hpc2.c 
b/arch/powerpc/platforms/embedded6xx/mpc7448_hpc2.c
index 15437abe1f6d..20b727584e40 100644
--- a/arch/powerpc/platforms/embedded6xx/mpc7448_hpc2.c
+++ b/arch/powerpc/platforms/embedded6xx/mpc7448_hpc2.c
@@ -58,16 +58,14 @@ int mpc7448_hpc2_exclude_device(struct pci_controller *hose,
return PCIBIOS_SUCCESSFUL;
 }
 
-static void __init mpc7448_hpc2_setup_arch(void)
+static void __init mpc7448_hpc2_setup_pci(void)
 {
+#ifdef CONFIG_PCI
struct device_node *np;
if (ppc_md.progress)
-   ppc_md.progress("mpc7448_hpc2_setup_arch():set_bridge", 0);
-
-   tsi108_csr_vir_base = get_vir_csrbase();
+   ppc_md.progress("mpc7448_hpc2_setup_pci():set_bridge", 0);
 
/* setup PCI host bridge */
-#ifdef CONFIG_PCI
for_each_compatible_node(np, "pci", "tsi108-pci")
tsi108_setup_pci(np, MPC7448HPC2_PCI_CFG_PHYS, 0);
 
@@ -75,6 +73,11 @@ static void __init mpc7448_hpc2_setup_arch(void)
if (ppc_md.progress)
ppc_md.progress("tsi108: resources set", 0x100);
 #endif
+}
+
+static void __init mpc7448_hpc2_setup_arch(void)
+{
+   tsi108_csr_vir_base = get_vir_csrbase();
 
printk(KERN_INFO "MPC7448HPC2 (TAIGA) Platform\n");
printk(KERN_INFO
@@ -180,6 +183,7 @@ define_machine(mpc7448_hpc2){
.name   = "MPC7448 HPC2",
.probe  = mpc7448_hpc2_probe,
.setup_arch = mpc7448_hpc2_setup_arch,
+   .discover_phbs  = mpc7448_hpc2_setup_pci,
.init_IRQ   = mpc7448_hpc2_init_IRQ,
.show_cpuinfo   = mpc7448_hpc2_show_cpuinfo,
.get_irq= mpic_get_irq,
-- 
2.26.2



[RFC PATCH 13/18] powerpc/embedded6xx/holly: Move PHB discovery

2020-09-24 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with holly_defconfig
---
 arch/powerpc/platforms/embedded6xx/holly.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/embedded6xx/holly.c 
b/arch/powerpc/platforms/embedded6xx/holly.c
index d8f2e2c737bb..53065d564161 100644
--- a/arch/powerpc/platforms/embedded6xx/holly.c
+++ b/arch/powerpc/platforms/embedded6xx/holly.c
@@ -108,15 +108,13 @@ static void holly_remap_bridge(void)
tsi108_write_reg(TSI108_PCI_P2O_BAR2, 0x0);
 }
 
-static void __init holly_setup_arch(void)
+static void __init holly_init_pci(void)
 {
struct device_node *np;
 
if (ppc_md.progress)
ppc_md.progress("holly_setup_arch():set_bridge", 0);
 
-   tsi108_csr_vir_base = get_vir_csrbase();
-
/* setup PCI host bridge */
holly_remap_bridge();
 
@@ -127,6 +125,11 @@ static void __init holly_setup_arch(void)
ppc_md.pci_exclude_device = holly_exclude_device;
if (ppc_md.progress)
ppc_md.progress("tsi108: resources set", 0x100);
+}
+
+static void __init holly_setup_arch(void)
+{
+   tsi108_csr_vir_base = get_vir_csrbase();
 
printk(KERN_INFO "PPC750GX/CL Platform\n");
 }
@@ -259,6 +262,7 @@ define_machine(holly){
.name   = "PPC750 GX/CL TSI",
.probe  = holly_probe,
.setup_arch = holly_setup_arch,
+   .discover_phbs  = holly_init_pci,
.init_IRQ   = holly_init_IRQ,
.show_cpuinfo   = holly_show_cpuinfo,
.get_irq= mpic_get_irq,
-- 
2.26.2



[RFC PATCH 14/18] powerpc/embedded6xx/linkstation: Move PHB discovery

2020-09-24 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with linkstation_defconfig
---
 arch/powerpc/platforms/embedded6xx/linkstation.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/embedded6xx/linkstation.c 
b/arch/powerpc/platforms/embedded6xx/linkstation.c
index f514d5d28cd4..eb8342e7f84e 100644
--- a/arch/powerpc/platforms/embedded6xx/linkstation.c
+++ b/arch/powerpc/platforms/embedded6xx/linkstation.c
@@ -63,15 +63,18 @@ static int __init linkstation_add_bridge(struct device_node 
*dev)
 }
 
 static void __init linkstation_setup_arch(void)
+{
+   printk(KERN_INFO "BUFFALO Network Attached Storage Series\n");
+   printk(KERN_INFO "(C) 2002-2005 BUFFALO INC.\n");
+}
+
+static void __init linkstation_setup_pci(void)
 {
struct device_node *np;
 
/* Lookup PCI host bridges */
for_each_compatible_node(np, "pci", "mpc10x-pci")
linkstation_add_bridge(np);
-
-   printk(KERN_INFO "BUFFALO Network Attached Storage Series\n");
-   printk(KERN_INFO "(C) 2002-2005 BUFFALO INC.\n");
 }
 
 /*
@@ -153,6 +156,7 @@ define_machine(linkstation){
.name   = "Buffalo Linkstation",
.probe  = linkstation_probe,
.setup_arch = linkstation_setup_arch,
+   .discover_phbs  = linkstation_setup_pci,
.init_IRQ   = linkstation_init_IRQ,
.show_cpuinfo   = linkstation_show_cpuinfo,
.get_irq= mpic_get_irq,
-- 
2.26.2



[RFC PATCH 12/18] powerpc/chrp: Move PHB discovery

2020-09-24 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with chrp32_defconfig
---
 arch/powerpc/platforms/chrp/pci.c   |  8 
 arch/powerpc/platforms/chrp/setup.c | 12 +---
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/chrp/pci.c 
b/arch/powerpc/platforms/chrp/pci.c
index b2c2bf35b76c..8c421dc78b28 100644
--- a/arch/powerpc/platforms/chrp/pci.c
+++ b/arch/powerpc/platforms/chrp/pci.c
@@ -314,6 +314,14 @@ chrp_find_bridges(void)
}
}
of_node_put(root);
+
+   /*
+*  "Temporary" fixes for PCI devices.
+*  -- Geert
+*/
+   hydra_init();   /* Mac I/O */
+
+   pci_create_OF_bus_map();
 }
 
 /* SL82C105 IDE Control/Status Register */
diff --git a/arch/powerpc/platforms/chrp/setup.c 
b/arch/powerpc/platforms/chrp/setup.c
index c45435aa5e36..3cfc382841e5 100644
--- a/arch/powerpc/platforms/chrp/setup.c
+++ b/arch/powerpc/platforms/chrp/setup.c
@@ -334,22 +334,11 @@ static void __init chrp_setup_arch(void)
/* On pegasos, enable the L2 cache if not already done by OF */
pegasos_set_l2cr();
 
-   /* Lookup PCI host bridges */
-   chrp_find_bridges();
-
-   /*
-*  Temporary fixes for PCI devices.
-*  -- Geert
-*/
-   hydra_init();   /* Mac I/O */
-
/*
 *  Fix the Super I/O configuration
 */
sio_init();
 
-   pci_create_OF_bus_map();
-
/*
 * Print the banner, then scroll down so boot progress
 * can be printed.  -- Cort
@@ -582,6 +571,7 @@ define_machine(chrp) {
.name   = "CHRP",
.probe  = chrp_probe,
.setup_arch = chrp_setup_arch,
+   .discover_phbs  = chrp_find_bridges,
.init   = chrp_init2,
.show_cpuinfo   = chrp_show_cpuinfo,
.init_IRQ   = chrp_init_IRQ,
-- 
2.26.2



[RFC PATCH 11/18] powerpc/amigaone: Move PHB discovery

2020-09-24 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with amigaone_defconfig
---
 arch/powerpc/platforms/amigaone/setup.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/amigaone/setup.c 
b/arch/powerpc/platforms/amigaone/setup.c
index f5d0bf999759..b25ddf39dd43 100644
--- a/arch/powerpc/platforms/amigaone/setup.c
+++ b/arch/powerpc/platforms/amigaone/setup.c
@@ -65,6 +65,12 @@ static int __init amigaone_add_bridge(struct device_node 
*dev)
 }
 
 void __init amigaone_setup_arch(void)
+{
+   if (ppc_md.progress)
+   ppc_md.progress("Linux/PPC "UTS_RELEASE"\n", 0);
+}
+
+void __init amigaone_discover_phbs(void)
 {
struct device_node *np;
int phb = -ENODEV;
@@ -74,9 +80,6 @@ void __init amigaone_setup_arch(void)
phb = amigaone_add_bridge(np);
 
BUG_ON(phb != 0);
-
-   if (ppc_md.progress)
-   ppc_md.progress("Linux/PPC "UTS_RELEASE"\n", 0);
 }
 
 void __init amigaone_init_IRQ(void)
@@ -159,6 +162,7 @@ define_machine(amigaone) {
.name   = "AmigaOne",
.probe  = amigaone_probe,
.setup_arch = amigaone_setup_arch,
+   .discover_phbs  = amigaone_discover_phbs,
.show_cpuinfo   = amigaone_show_cpuinfo,
.init_IRQ   = amigaone_init_IRQ,
.restart= amigaone_restart,
-- 
2.26.2



[RFC PATCH 10/18] powerpc/83xx: Move PHB discovery

2020-09-24 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with mpc83xx_defconfig
---
 arch/powerpc/platforms/83xx/asp834x.c | 1 +
 arch/powerpc/platforms/83xx/km83xx.c  | 1 +
 arch/powerpc/platforms/83xx/misc.c| 2 --
 arch/powerpc/platforms/83xx/mpc830x_rdb.c | 1 +
 arch/powerpc/platforms/83xx/mpc831x_rdb.c | 1 +
 arch/powerpc/platforms/83xx/mpc832x_mds.c | 1 +
 arch/powerpc/platforms/83xx/mpc832x_rdb.c | 1 +
 arch/powerpc/platforms/83xx/mpc834x_itx.c | 1 +
 arch/powerpc/platforms/83xx/mpc834x_mds.c | 1 +
 arch/powerpc/platforms/83xx/mpc836x_mds.c | 1 +
 arch/powerpc/platforms/83xx/mpc836x_rdk.c | 1 +
 arch/powerpc/platforms/83xx/mpc837x_mds.c | 1 +
 arch/powerpc/platforms/83xx/mpc837x_rdb.c | 1 +
 13 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/83xx/asp834x.c 
b/arch/powerpc/platforms/83xx/asp834x.c
index 28474876f41b..68061c2a57c1 100644
--- a/arch/powerpc/platforms/83xx/asp834x.c
+++ b/arch/powerpc/platforms/83xx/asp834x.c
@@ -44,6 +44,7 @@ define_machine(asp834x) {
.name   = "ASP8347E",
.probe  = asp834x_probe,
.setup_arch = asp834x_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restart,
diff --git a/arch/powerpc/platforms/83xx/km83xx.c 
b/arch/powerpc/platforms/83xx/km83xx.c
index bcdc2c203ec9..108e1e4d2683 100644
--- a/arch/powerpc/platforms/83xx/km83xx.c
+++ b/arch/powerpc/platforms/83xx/km83xx.c
@@ -180,6 +180,7 @@ define_machine(mpc83xx_km) {
.name   = "mpc83xx-km-platform",
.probe  = mpc83xx_km_probe,
.setup_arch = mpc83xx_km_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restart,
diff --git a/arch/powerpc/platforms/83xx/misc.c 
b/arch/powerpc/platforms/83xx/misc.c
index a952e91db3ee..3285dabcf923 100644
--- a/arch/powerpc/platforms/83xx/misc.c
+++ b/arch/powerpc/platforms/83xx/misc.c
@@ -132,8 +132,6 @@ void __init mpc83xx_setup_arch(void)
setbat(-1, va, immrbase, immrsize, PAGE_KERNEL_NCG);
update_bats();
}
-
-   mpc83xx_setup_pci();
 }
 
 int machine_check_83xx(struct pt_regs *regs)
diff --git a/arch/powerpc/platforms/83xx/mpc830x_rdb.c 
b/arch/powerpc/platforms/83xx/mpc830x_rdb.c
index 51426e88ec67..956d4389effa 100644
--- a/arch/powerpc/platforms/83xx/mpc830x_rdb.c
+++ b/arch/powerpc/platforms/83xx/mpc830x_rdb.c
@@ -48,6 +48,7 @@ define_machine(mpc830x_rdb) {
.name   = "MPC830x RDB",
.probe  = mpc830x_rdb_probe,
.setup_arch = mpc830x_rdb_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restart,
diff --git a/arch/powerpc/platforms/83xx/mpc831x_rdb.c 
b/arch/powerpc/platforms/83xx/mpc831x_rdb.c
index 5ccd57a48492..3b578f080e3b 100644
--- a/arch/powerpc/platforms/83xx/mpc831x_rdb.c
+++ b/arch/powerpc/platforms/83xx/mpc831x_rdb.c
@@ -48,6 +48,7 @@ define_machine(mpc831x_rdb) {
.name   = "MPC831x RDB",
.probe  = mpc831x_rdb_probe,
.setup_arch = mpc831x_rdb_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restart,
diff --git a/arch/powerpc/platforms/83xx/mpc832x_mds.c 
b/arch/powerpc/platforms/83xx/mpc832x_mds.c
index 6fa5402ebf20..850d566ef900 100644
--- a/arch/powerpc/platforms/83xx/mpc832x_mds.c
+++ b/arch/powerpc/platforms/83xx/mpc832x_mds.c
@@ -101,6 +101,7 @@ define_machine(mpc832x_mds) {
.name   = "MPC832x MDS",
.probe  = mpc832x_sys_probe,
.setup_arch = mpc832x_sys_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restart,
diff --git a/arch/powerpc/platforms/83xx/mpc832x_rdb.c 
b/arch/powerpc/platforms/83xx/mpc832x_rdb.c
index 622c625d5ce4..b6133a237a70 100644
--- a/arch/powerpc/platforms/83xx/mpc832x_rdb.c
+++ b/arch/powerpc/platforms/83xx/mpc832x_rdb.c
@@ -219,6 +219,7 @@ define_machine(mpc832x_rdb) {
.name   = "MPC832x RDB",
.probe  = mpc832x_rdb_probe,
.setup_arch = mpc832x_rdb_setup_arch,
+   .discover_phbs  = mpc83xx_setup_pci,
.init_IRQ   = mpc83xx_ipic_init_IRQ,
.get_irq= ipic_get_irq,
.restart= mpc83xx_restar

[RFC PATCH 09/18] powerpc/82xx/*: Move PHB discovery

2020-09-24 Thread Oliver O'Halloran
Signed-off-by: Oliver O'Halloran 
---
compile tested with pq2fads_defconfig
---
 arch/powerpc/platforms/82xx/mpc8272_ads.c | 2 +-
 arch/powerpc/platforms/82xx/pq2fads.c | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/82xx/mpc8272_ads.c 
b/arch/powerpc/platforms/82xx/mpc8272_ads.c
index 3fe1a6593280..0b5b9dec16d5 100644
--- a/arch/powerpc/platforms/82xx/mpc8272_ads.c
+++ b/arch/powerpc/platforms/82xx/mpc8272_ads.c
@@ -171,7 +171,6 @@ static void __init mpc8272_ads_setup_arch(void)
iounmap(bcsr);
 
init_ioports();
-   pq2_init_pci();
 
if (ppc_md.progress)
ppc_md.progress("mpc8272_ads_setup_arch(), finish", 0);
@@ -205,6 +204,7 @@ define_machine(mpc8272_ads)
.name = "Freescale MPC8272 ADS",
.probe = mpc8272_ads_probe,
.setup_arch = mpc8272_ads_setup_arch,
+   .discover_phbs = pq2_init_pci,
.init_IRQ = mpc8272_ads_pic_init,
.get_irq = cpm2_get_irq,
.calibrate_decr = generic_calibrate_decr,
diff --git a/arch/powerpc/platforms/82xx/pq2fads.c 
b/arch/powerpc/platforms/82xx/pq2fads.c
index a74082140718..ac9113d524af 100644
--- a/arch/powerpc/platforms/82xx/pq2fads.c
+++ b/arch/powerpc/platforms/82xx/pq2fads.c
@@ -150,8 +150,6 @@ static void __init pq2fads_setup_arch(void)
/* Enable external IRQs */
clrbits32(_immr->im_siu_conf.siu_82xx.sc_siumcr, 0x0c00);
 
-   pq2_init_pci();
-
if (ppc_md.progress)
ppc_md.progress("pq2fads_setup_arch(), finish", 0);
 }
@@ -184,6 +182,7 @@ define_machine(pq2fads)
.name = "Freescale PQ2FADS",
.probe = pq2fads_probe,
.setup_arch = pq2fads_setup_arch,
+   .discover_phbs = pq2_init_pci,
.init_IRQ = pq2fads_pic_init,
.get_irq = cpm2_get_irq,
.calibrate_decr = generic_calibrate_decr,
-- 
2.26.2



  1   2   3   4   5   6   7   8   9   >