Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-11 Thread Rob Herring
On Tue, Dec 10, 2019 at 2:17 AM Frank Rowand  wrote:
>
> On 12/9/19 7:51 PM, Rob Herring wrote:
> > On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
> >  wrote:
> >>
> >> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
> >>> Is there a memory usage issue for the systems that led to this thread?
> >>
> >> No, no memory issue led to this thread. I was just testing my patch and
> >> I assumed that I did something wrong in the counting/lock drop/lock
> >> acquire/allocate path because the array was hardly used. So I started to
> >> look deeper…
> >> Once I figured out everything was fine, I was curious if everyone is
> >> aware of the different phandle creation by dtc vs POWER. And I posted
> >> the mail in the thread.
> >> Once you confirmed that everything is "known / not an issue" I was ready
> >> to take off [0].
> >>
> >> Later more replies came in such as one mail [1] from Rob describing the
> >> original reason with 814 phandles. _Here_ I was just surprised that 1024
> >> were used over 64 entries for a benefit of 60ms. I understand that this
> >> is low concern for you because that memory is released if modules are
> >> not enabled. I usually see that module support is left enabled.
> >>
> >> However, Rob suggested / asked about the fixed size array (this is how I
> >> understood it):
> >> |And yes, as mentioned earlier I don't like the complexity. I didn't
> >> |from the start and I'm  I'm still of the opinion we should have a
> >> |fixed or 1 time sized true cache (i.e. smaller than total # of
> >> |phandles). That would solve the RT memory allocation and locking issue
> >> |too.
> >>
> >> so I attempted to ask if we should have the fixed size array maybe
> >> with the hash_32() instead the mask. This would make my other patch
> >> obsolete because the fixed size array should not have a RT issue. The
> >> hash_32() part here would address the POWER issue where the cache is
> >> currently not used efficiently.
> >>
> >> If you want instead to keep things as-is then this is okay from my side.
> >> If you want to keep this cache off on POWER then I could contribute a
> >> patch doing so.
> >
> > It turns out there's actually a bug in the current implementation. If
> > we have multiple phandles with the same mask, then we leak node
> > references if we miss in the cache and re-assign the cache entry.
>
> Aaargh.  Patch sent.
>
> > Easily fixed I suppose, but holding a ref count for a cached entry
> > seems wrong. That means we never have a ref count of 0 on every node
> > with a phandle.
>
> It will go to zero when the cache is freed.
>
> My memory is that we free the cache as part of removing an overlay.  I'll
> verify whether my memory is correct.

Yes, as part of having entries for every phandle we release and
realloc when number of phandles changes. If the size is fixed, then we
can stop doing that. We only need to remove entries in
of_detach_node() as that should always happen before nodes are
removed, right?

Rob


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-10 Thread Frank Rowand
On 12/10/19 2:17 AM, Frank Rowand wrote:
> On 12/9/19 7:51 PM, Rob Herring wrote:
>> On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
>>  wrote:
>>>
>>> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
 Is there a memory usage issue for the systems that led to this thread?
>>>
>>> No, no memory issue led to this thread. I was just testing my patch and
>>> I assumed that I did something wrong in the counting/lock drop/lock
>>> acquire/allocate path because the array was hardly used. So I started to
>>> look deeper…
>>> Once I figured out everything was fine, I was curious if everyone is
>>> aware of the different phandle creation by dtc vs POWER. And I posted
>>> the mail in the thread.
>>> Once you confirmed that everything is "known / not an issue" I was ready
>>> to take off [0].
>>>
>>> Later more replies came in such as one mail [1] from Rob describing the
>>> original reason with 814 phandles. _Here_ I was just surprised that 1024
>>> were used over 64 entries for a benefit of 60ms. I understand that this
>>> is low concern for you because that memory is released if modules are
>>> not enabled. I usually see that module support is left enabled.
>>>
>>> However, Rob suggested / asked about the fixed size array (this is how I
>>> understood it):
>>> |And yes, as mentioned earlier I don't like the complexity. I didn't
>>> |from the start and I'm  I'm still of the opinion we should have a
>>> |fixed or 1 time sized true cache (i.e. smaller than total # of
>>> |phandles). That would solve the RT memory allocation and locking issue
>>> |too.
>>>
>>> so I attempted to ask if we should have the fixed size array maybe
>>> with the hash_32() instead the mask. This would make my other patch
>>> obsolete because the fixed size array should not have a RT issue. The
>>> hash_32() part here would address the POWER issue where the cache is
>>> currently not used efficiently.
>>>
>>> If you want instead to keep things as-is then this is okay from my side.
>>> If you want to keep this cache off on POWER then I could contribute a
>>> patch doing so.
>>
>> It turns out there's actually a bug in the current implementation. If
>> we have multiple phandles with the same mask, then we leak node
>> references if we miss in the cache and re-assign the cache entry.
> 
> Aaargh.  Patch sent.
> 
>> Easily fixed I suppose, but holding a ref count for a cached entry
>> seems wrong. That means we never have a ref count of 0 on every node
>> with a phandle.
> 
> It will go to zero when the cache is freed.
> 
> My memory is that we free the cache as part of removing an overlay.  I'll
> verify whether my memory is correct.

And I'll look at non-overlay use of dynamic devicetree too.

-Frank

> 
> -Frank
> 
> 
>>
>> I've done some more experiments with the performance. I've come to the
>> conclusion that just measuring boot time is not a good way at least if
>> the time is not a significant percentage of the total. I couldn't get
>> any measurable data. I'm using a RK3399 Rock960 board. It has 190
>> phandles. I get about 1500 calls to of_find_node_by_phandle() during
>> boot. Note that about the first 300 are before we have any timekeeping
>> (the prior measurements also would not account for this). Those calls
>> have no cache in the current implementation and are cached in my
>> implementation.
>>
>> no cache:  20124 us
>> current cache: 819 us
>>
>> new cache (64 entry): 4342 us
>> new cache (128 entry): 2875 us
>> new cache (256 entry): 1235 us
>>
>> To get some idea on the time spent before timekeeping is up, if we
>> take the avg miss time is ~17us (20124/1200), then we're spending
>> about ~5ms on lookups before the cache is enabled. I'd estimate the
>> new cache takes ~400us before timekeeping is up as there's 11 misses
>> early.
>>
>> >From these numbers, it seems the miss rate has a significant impact on
>> performance for the different sizes. But taken from the original 20+
>> ms, they all look like good improvement.
>>
>> Rob
>>
> 
> 



Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-10 Thread Frank Rowand
On 12/9/19 7:51 PM, Rob Herring wrote:
> On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
>  wrote:
>>
>> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
>>> Is there a memory usage issue for the systems that led to this thread?
>>
>> No, no memory issue led to this thread. I was just testing my patch and
>> I assumed that I did something wrong in the counting/lock drop/lock
>> acquire/allocate path because the array was hardly used. So I started to
>> look deeper…
>> Once I figured out everything was fine, I was curious if everyone is
>> aware of the different phandle creation by dtc vs POWER. And I posted
>> the mail in the thread.
>> Once you confirmed that everything is "known / not an issue" I was ready
>> to take off [0].
>>
>> Later more replies came in such as one mail [1] from Rob describing the
>> original reason with 814 phandles. _Here_ I was just surprised that 1024
>> were used over 64 entries for a benefit of 60ms. I understand that this
>> is low concern for you because that memory is released if modules are
>> not enabled. I usually see that module support is left enabled.
>>
>> However, Rob suggested / asked about the fixed size array (this is how I
>> understood it):
>> |And yes, as mentioned earlier I don't like the complexity. I didn't
>> |from the start and I'm  I'm still of the opinion we should have a
>> |fixed or 1 time sized true cache (i.e. smaller than total # of
>> |phandles). That would solve the RT memory allocation and locking issue
>> |too.
>>
>> so I attempted to ask if we should have the fixed size array maybe
>> with the hash_32() instead the mask. This would make my other patch
>> obsolete because the fixed size array should not have a RT issue. The
>> hash_32() part here would address the POWER issue where the cache is
>> currently not used efficiently.
>>
>> If you want instead to keep things as-is then this is okay from my side.
>> If you want to keep this cache off on POWER then I could contribute a
>> patch doing so.
> 
> It turns out there's actually a bug in the current implementation. If
> we have multiple phandles with the same mask, then we leak node
> references if we miss in the cache and re-assign the cache entry.

Aaargh.  Patch sent.

> Easily fixed I suppose, but holding a ref count for a cached entry
> seems wrong. That means we never have a ref count of 0 on every node
> with a phandle.

It will go to zero when the cache is freed.

My memory is that we free the cache as part of removing an overlay.  I'll
verify whether my memory is correct.

-Frank


> 
> I've done some more experiments with the performance. I've come to the
> conclusion that just measuring boot time is not a good way at least if
> the time is not a significant percentage of the total. I couldn't get
> any measurable data. I'm using a RK3399 Rock960 board. It has 190
> phandles. I get about 1500 calls to of_find_node_by_phandle() during
> boot. Note that about the first 300 are before we have any timekeeping
> (the prior measurements also would not account for this). Those calls
> have no cache in the current implementation and are cached in my
> implementation.
> 
> no cache:  20124 us
> current cache: 819 us
> 
> new cache (64 entry): 4342 us
> new cache (128 entry): 2875 us
> new cache (256 entry): 1235 us
> 
> To get some idea on the time spent before timekeeping is up, if we
> take the avg miss time is ~17us (20124/1200), then we're spending
> about ~5ms on lookups before the cache is enabled. I'd estimate the
> new cache takes ~400us before timekeeping is up as there's 11 misses
> early.
> 
>>From these numbers, it seems the miss rate has a significant impact on
> performance for the different sizes. But taken from the original 20+
> ms, they all look like good improvement.
> 
> Rob
> 



Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-09 Thread Rob Herring
On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
 wrote:
>
> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
> > Is there a memory usage issue for the systems that led to this thread?
>
> No, no memory issue led to this thread. I was just testing my patch and
> I assumed that I did something wrong in the counting/lock drop/lock
> acquire/allocate path because the array was hardly used. So I started to
> look deeper…
> Once I figured out everything was fine, I was curious if everyone is
> aware of the different phandle creation by dtc vs POWER. And I posted
> the mail in the thread.
> Once you confirmed that everything is "known / not an issue" I was ready
> to take off [0].
>
> Later more replies came in such as one mail [1] from Rob describing the
> original reason with 814 phandles. _Here_ I was just surprised that 1024
> were used over 64 entries for a benefit of 60ms. I understand that this
> is low concern for you because that memory is released if modules are
> not enabled. I usually see that module support is left enabled.
>
> However, Rob suggested / asked about the fixed size array (this is how I
> understood it):
> |And yes, as mentioned earlier I don't like the complexity. I didn't
> |from the start and I'm  I'm still of the opinion we should have a
> |fixed or 1 time sized true cache (i.e. smaller than total # of
> |phandles). That would solve the RT memory allocation and locking issue
> |too.
>
> so I attempted to ask if we should have the fixed size array maybe
> with the hash_32() instead the mask. This would make my other patch
> obsolete because the fixed size array should not have a RT issue. The
> hash_32() part here would address the POWER issue where the cache is
> currently not used efficiently.
>
> If you want instead to keep things as-is then this is okay from my side.
> If you want to keep this cache off on POWER then I could contribute a
> patch doing so.

It turns out there's actually a bug in the current implementation. If
we have multiple phandles with the same mask, then we leak node
references if we miss in the cache and re-assign the cache entry.
Easily fixed I suppose, but holding a ref count for a cached entry
seems wrong. That means we never have a ref count of 0 on every node
with a phandle.

I've done some more experiments with the performance. I've come to the
conclusion that just measuring boot time is not a good way at least if
the time is not a significant percentage of the total. I couldn't get
any measurable data. I'm using a RK3399 Rock960 board. It has 190
phandles. I get about 1500 calls to of_find_node_by_phandle() during
boot. Note that about the first 300 are before we have any timekeeping
(the prior measurements also would not account for this). Those calls
have no cache in the current implementation and are cached in my
implementation.

no cache:  20124 us
current cache: 819 us

new cache (64 entry): 4342 us
new cache (128 entry): 2875 us
new cache (256 entry): 1235 us

To get some idea on the time spent before timekeeping is up, if we
take the avg miss time is ~17us (20124/1200), then we're spending
about ~5ms on lookups before the cache is enabled. I'd estimate the
new cache takes ~400us before timekeeping is up as there's 11 misses
early.

>From these numbers, it seems the miss rate has a significant impact on
performance for the different sizes. But taken from the original 20+
ms, they all look like good improvement.

Rob


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-09 Thread Sebastian Andrzej Siewior
On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
> Is there a memory usage issue for the systems that led to this thread?

No, no memory issue led to this thread. I was just testing my patch and
I assumed that I did something wrong in the counting/lock drop/lock
acquire/allocate path because the array was hardly used. So I started to
look deeper…
Once I figured out everything was fine, I was curious if everyone is
aware of the different phandle creation by dtc vs POWER. And I posted
the mail in the thread.
Once you confirmed that everything is "known / not an issue" I was ready
to take off [0].

Later more replies came in such as one mail [1] from Rob describing the
original reason with 814 phandles. _Here_ I was just surprised that 1024
were used over 64 entries for a benefit of 60ms. I understand that this
is low concern for you because that memory is released if modules are
not enabled. I usually see that module support is left enabled.

However, Rob suggested / asked about the fixed size array (this is how I
understood it):
|And yes, as mentioned earlier I don't like the complexity. I didn't
|from the start and I'm  I'm still of the opinion we should have a
|fixed or 1 time sized true cache (i.e. smaller than total # of
|phandles). That would solve the RT memory allocation and locking issue
|too.

so I attempted to ask if we should have the fixed size array maybe
with the hash_32() instead the mask. This would make my other patch
obsolete because the fixed size array should not have a RT issue. The
hash_32() part here would address the POWER issue where the cache is
currently not used efficiently.

If you want instead to keep things as-is then this is okay from my side.
If you want to keep this cache off on POWER then I could contribute a
patch doing so.

[0] 
https://lore.kernel.org/linux-devicetree/20191202110732.4dvzrro5o6zrl...@linutronix.de/
[1] 
https://lore.kernel.org/linux-devicetree/CAL_JsqKieG5=teL7gABPKbJOQfvoS9s-ZPF-=r0yee_luoy...@mail.gmail.com/
> -Frank

Sebastian


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-07 Thread Frank Rowand
On 12/5/19 7:52 PM, Frank Rowand wrote:
> On 12/3/19 10:56 AM, Rob Herring wrote:
>> On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand  wrote:
>>>
>>> On 12/2/19 10:12 PM, Michael Ellerman wrote:
 Frank Rowand  writes:
> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
>> I've been looking at phandle_cache and noticed the following: The raw
>> phandle value as generated by dtc starts at zero and is incremented by
>> one for each phandle entry. The qemu pSeries model is using Slof (which
>> is probably the same thing as used on real hardware) and this looks like
>> a poiner value for the phandle.
>> With
>> qemu-system-ppc64le -m 16G -machine pseries -smp 8
>>
>> I got the following output:
>> | entries: 64
>> | phandle 7e732468 slot 28 hash c
>> | phandle 7e732ad0 slot 10 hash 27
>> | phandle 7e732ee8 slot 28 hash 3a
>> | phandle 7e734160 slot 20 hash 36
>> | phandle 7e734318 slot 18 hash 3a
>> | phandle 7e734428 slot 28 hash 33
>> | phandle 7e734538 slot 38 hash 2c
>> | phandle 7e734850 slot 10 hash e
>> | phandle 7e735220 slot 20 hash 2d
>> | phandle 7e735bf0 slot 30 hash d
>> | phandle 7e7365c0 slot 0 hash 2d
>> | phandle 7e736f90 slot 10 hash d
>> | phandle 7e737960 slot 20 hash 2d
>> | phandle 7e738330 slot 30 hash d
>> | phandle 7e738d00 slot 0 hash 2d
>> | phandle 7e739730 slot 30 hash 38
>> | phandle 7e73bd08 slot 8 hash 17
>> | phandle 7e73c2e0 slot 20 hash 32
>> | phandle 7e73c7f8 slot 38 hash 37
>> | phandle 7e782420 slot 20 hash 13
>> | phandle 7e782ed8 slot 18 hash 1b
>> | phandle 7e73ce28 slot 28 hash 39
>> | phandle 7e73d390 slot 10 hash 22
>> | phandle 7e73d9a8 slot 28 hash 1a
>> | phandle 7e73dc28 slot 28 hash 37
>> | phandle 7e73de00 slot 0 hash a
>> | phandle 7e73e028 slot 28 hash 0
>> | phandle 7e7621a8 slot 28 hash 36
>> | phandle 7e73e458 slot 18 hash 1e
>> | phandle 7e73e608 slot 8 hash 1e
>> | phandle 7e740078 slot 38 hash 28
>> | phandle 7e740180 slot 0 hash 1d
>> | phandle 7e740240 slot 0 hash 33
>> | phandle 7e740348 slot 8 hash 29
>> | phandle 7e740410 slot 10 hash 2
>> | phandle 7e740eb0 slot 30 hash 3e
>> | phandle 7e745390 slot 10 hash 33
>> | phandle 7e747b08 slot 8 hash c
>> | phandle 7e748528 slot 28 hash f
>> | phandle 7e74a6e0 slot 20 hash 18
>> | phandle 7e74aab0 slot 30 hash b
>> | phandle 7e74f788 slot 8 hash d
>> | Used entries: 8, hashed: 29
>>
>> So the hash array has 64 entries out which only 8 are populated. Using
>> hash_32() populates 29 entries.
>> Could someone with real hardware verify this?
>> I'm not sure how important this performance wise, it looks just like a
>> waste using only 1/8 of the array.
>
> The hash used is based on the assumptions you noted, and as stated in the
> code, that phandle property values are in a contiguous range of 1..n
> (not starting from zero), which is what dtc generates.

> We knew that for systems that do not match the assumptions that the hash
> will not be optimal.

 If we're going to have the phandle cache it should at least make some
 attempt to work across the systems that Linux supports.

> Unless there is a serious performance problem for
> such systems, I do not want to make the phandle hash code more complicated
> to optimize for these cases.  And the pseries have been performing ok
> without phandle related performance issues that I remember hearing since
> before the cache was added, which could have only helped the performance.
> Yes, if your observations are correct, some memory is being wasted, but
> a 64 entry cache is not very large on a pseries.

 A single line change to use an actual hash function is hardly
 complicating it, compared to the amount of code already there.
>>>
>>> With a dtc generated devicetree, the hash is perfect, with no
>>> misses.  That single line change then makes the hash bad for
>>> dtc generated devicetrees.
>>
>> To quantify that, I did some tests with the hash algo and it's about a
>> 10% collision rate using phandle values of 1-$cache_size. There's
>> never more than 2 phandles colliding in a slot. The actual impact
>> would be less given 100% thrashing seems unlikely.

I also tested a few cache sizes.

For cache size 64 entries, I get a 14% collision rate (and also 14% of
entries unused).  For 1024 entries, I get a 12% collision rate.  And
I can confirm no more than two phandles hashing to the same slot.

So essentially confirming your data.

-Frank

> 
> Thank you for doing the tests.  Actual data helps a lot.
> 
> If there is only a 10% collision rate for this case, that does not
> sound bad to me.  There is the possibility of current or future
> code resulting in ping ponging between two phandle values which
> collide in the cache, but that should 

Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-07 Thread Frank Rowand
On 12/6/19 5:40 PM, Segher Boessenkool wrote:
> Hi,
> 
> On Thu, Dec 05, 2019 at 07:37:24PM -0600, Frank Rowand wrote:
>> On 12/3/19 12:35 PM, Segher Boessenkool wrote:
>>> Btw.  Some OFs mangle the phandles some way, to make it easier to catch
>>> people using it as an address (and similarly, mangle ihandles differently,
>>> so you catch confusion between ihandles and phandles as well).  Like a
>>> simple xor, with some odd number preferably.  You should assume *nothing*
>>> about phandles, they are opaque identifiers.
>>
>> For arm32 machines that use dtc to generate the devicetree, which is a
>> very large user base, we certainly can make assumptions about phandles.
> 
> I was talking about OF.  Phandles are explicitly defined to be opaque
> tokens.  If there is an extra meaning to them in flattened device trees,
> well, the kernel should then only depend on that there, not for more
> general phandles.  Where is this documented btw?

And dtc generated devicetrees are a huge proportion of the OF systems.

It is not documented.

As an aside, overlays also depend upon the current dtc implementation.
If an extremely large value is used for a phandle then overlay
application will fail.


> 
>> Especially because the complaints about the overhead of phandle based
>> lookups have been voiced by users of this specific set of machines.
>>
>> For systems with a devicetree that does not follow the assumptions, the
>> phandle cache should not measurably increase the overhead of phandle
>> based lookups.
> 
> It's an extra memory access and extra code to execute, for not much gain
> (if anything).  While with a reasonable hash function it will be good
> for everyone.
> 
>> If you have measurements of a system where implementing the phandle
>> cache increased the overhead,
> 
> Are you seriously saying you think this code can run in zero time and
> space on most systems?

No.  I made no such claim.  Note the additional words in the following
sentences.


>> and the additional overhead is a concern
>> (such as significantly increasing boot time) then please share that
>> information with us.  Otherwise this is just a theoretical exercise.
> 
> The point is that this code could be easily beneficial for most (or all)
> users, not just those that use dtc-constructed device trees.  It is

The point is that the cache was implemented to solve a specific problem
for certain specific systems.  There had been a few reports of various
machines having the same issue, but finally someone measures a **significant**
improvement in boot time for a specific machine.  The boot time with
the cache was **measured** to be much shorter.  The boot time for all
systems with a dtc generated devicetree is expected to be faster.  No
one responded to the implementation when it was proposed with a
**measurement** that showed increased boot time.  A concern of using
more memory was raised and discussed, with at least on feature added
as a result (freeing the cache in late init if modules are not
enabled).

Being "beneficial for most (or all) users" has to be balanced against
whether the change would remove the benefit for the system that the
feature was originally implemented to solve a problem for.  There
was no performance data supplied to answer this question.  (Though
eventually Rob did some measurements of the impact on hash efficiency
for such a system.)


> completely obvious that having a worse cache hash function results in
> many more lookups.  Whether that results in something expressed as
> milliseconds on tiny systems or microseconds on bigger systems is
> completely beside the point.

There was no performance data accompanying the proposed change that
started this thread.  There was no data showing whether the systems
that this feature was created for would suffer.  There was no data
showing that the boot time of the pseries systems would improve.
There was no assertion made that too much memory was being used
by the cache (there was an implied assertion that a large
percentage of the memory used for the cache was not used, thus
the performance benefit of the cache could be improved by
changing to using a hash instead of mask).  We had rejected
creating a cache for several years until finally some solid
data was provided showing an actual need for it.

It is not a question of "milliseconds on tiny systems or
microseconds on bigger systems".  I agree with that.  But it
does matter whether the performance impact of various
implementations is large enough to either solve a problem
or create a problem.  On the other hand, if the amount of
memory used by the cache is a problem (which is _not_
what was asserted by the patch submitter) then we can have
a conversation about how to resolve that.

-Frank

> 
> 
> Segher
> 



Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-06 Thread Segher Boessenkool
Hi,

On Thu, Dec 05, 2019 at 07:37:24PM -0600, Frank Rowand wrote:
> On 12/3/19 12:35 PM, Segher Boessenkool wrote:
> > Btw.  Some OFs mangle the phandles some way, to make it easier to catch
> > people using it as an address (and similarly, mangle ihandles differently,
> > so you catch confusion between ihandles and phandles as well).  Like a
> > simple xor, with some odd number preferably.  You should assume *nothing*
> > about phandles, they are opaque identifiers.
> 
> For arm32 machines that use dtc to generate the devicetree, which is a
> very large user base, we certainly can make assumptions about phandles.

I was talking about OF.  Phandles are explicitly defined to be opaque
tokens.  If there is an extra meaning to them in flattened device trees,
well, the kernel should then only depend on that there, not for more
general phandles.  Where is this documented btw?

> Especially because the complaints about the overhead of phandle based
> lookups have been voiced by users of this specific set of machines.
> 
> For systems with a devicetree that does not follow the assumptions, the
> phandle cache should not measurably increase the overhead of phandle
> based lookups.

It's an extra memory access and extra code to execute, for not much gain
(if anything).  While with a reasonable hash function it will be good
for everyone.

> If you have measurements of a system where implementing the phandle
> cache increased the overhead,

Are you seriously saying you think this code can run in zero time and
space on most systems?

> and the additional overhead is a concern
> (such as significantly increasing boot time) then please share that
> information with us.  Otherwise this is just a theoretical exercise.

The point is that this code could be easily beneficial for most (or all)
users, not just those that use dtc-constructed device trees.  It is
completely obvious that having a worse cache hash function results in
many more lookups.  Whether that results in something expressed as
milliseconds on tiny systems or microseconds on bigger systems is
completely beside the point.


Segher


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-05 Thread Frank Rowand
On 12/5/19 10:35 AM, Sebastian Andrzej Siewior wrote:
> On 2019-12-03 10:56:35 [-0600], Rob Herring wrote:
>>> Another possibility would be to make the cache be dependent
>>> upon not CONFIG_PPC.  It might be possible to disable the
>>> cache with a minimal code change.
>>
>> I'd rather not do that.
>>
>> And yes, as mentioned earlier I don't like the complexity. I didn't
>> from the start and I'm  I'm still of the opinion we should have a
>> fixed or 1 time sized true cache (i.e. smaller than total # of
>> phandles). That would solve the RT memory allocation and locking issue
>> too.
>>
>> For reference, the performance difference between the current
>> implementation (assuming fixes haven't regressed it) was ~400ms vs. a
>> ~340ms improvement with a 64 entry cache (using a mask, not a hash).
>> IMO, 340ms improvement was good enough.
> 
> Okay. So the 814 phandles would result in an array with 1024 slots. That
> would need 4KiB of memory.

Is this amount of memory an issue for this system?

If module support is not configured into the kernel then the cache is
removed and memory freed in a late initcall.  I don't know if that helps
your use case or not.


> What about we go back to the fix 64 slots array but with hash32 for the
> lookup? Without the hash we would be 60ms slower during boot (compared
> to now, based on ancient data) but then the hash isn't expensive so we
> end up with better coverage of the memory on systems which don't have a
> plain enumeration of the phandle.

That performance data is specific to one particular system.  It does not
generalize to all devicetree based systems.  So don't draw too many
conclusions from it.  If you want to understand the boot performance
impact for your system, you need to measure the alternatives on
your system.

Is there a memory usage issue for the systems that led to this thread?
Unless there is a documented memory issue, I do not want to expand an
issue with poor cache bucket percent utilization to the other issue
of cache size.

-Frank

> 
>> Rob
> 
> Sebastian
> 



Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-05 Thread Frank Rowand
On 12/3/19 10:56 AM, Rob Herring wrote:
> On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand  wrote:
>>
>> On 12/2/19 10:12 PM, Michael Ellerman wrote:
>>> Frank Rowand  writes:
 On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
> I've been looking at phandle_cache and noticed the following: The raw
> phandle value as generated by dtc starts at zero and is incremented by
> one for each phandle entry. The qemu pSeries model is using Slof (which
> is probably the same thing as used on real hardware) and this looks like
> a poiner value for the phandle.
> With
> qemu-system-ppc64le -m 16G -machine pseries -smp 8
>
> I got the following output:
> | entries: 64
> | phandle 7e732468 slot 28 hash c
> | phandle 7e732ad0 slot 10 hash 27
> | phandle 7e732ee8 slot 28 hash 3a
> | phandle 7e734160 slot 20 hash 36
> | phandle 7e734318 slot 18 hash 3a
> | phandle 7e734428 slot 28 hash 33
> | phandle 7e734538 slot 38 hash 2c
> | phandle 7e734850 slot 10 hash e
> | phandle 7e735220 slot 20 hash 2d
> | phandle 7e735bf0 slot 30 hash d
> | phandle 7e7365c0 slot 0 hash 2d
> | phandle 7e736f90 slot 10 hash d
> | phandle 7e737960 slot 20 hash 2d
> | phandle 7e738330 slot 30 hash d
> | phandle 7e738d00 slot 0 hash 2d
> | phandle 7e739730 slot 30 hash 38
> | phandle 7e73bd08 slot 8 hash 17
> | phandle 7e73c2e0 slot 20 hash 32
> | phandle 7e73c7f8 slot 38 hash 37
> | phandle 7e782420 slot 20 hash 13
> | phandle 7e782ed8 slot 18 hash 1b
> | phandle 7e73ce28 slot 28 hash 39
> | phandle 7e73d390 slot 10 hash 22
> | phandle 7e73d9a8 slot 28 hash 1a
> | phandle 7e73dc28 slot 28 hash 37
> | phandle 7e73de00 slot 0 hash a
> | phandle 7e73e028 slot 28 hash 0
> | phandle 7e7621a8 slot 28 hash 36
> | phandle 7e73e458 slot 18 hash 1e
> | phandle 7e73e608 slot 8 hash 1e
> | phandle 7e740078 slot 38 hash 28
> | phandle 7e740180 slot 0 hash 1d
> | phandle 7e740240 slot 0 hash 33
> | phandle 7e740348 slot 8 hash 29
> | phandle 7e740410 slot 10 hash 2
> | phandle 7e740eb0 slot 30 hash 3e
> | phandle 7e745390 slot 10 hash 33
> | phandle 7e747b08 slot 8 hash c
> | phandle 7e748528 slot 28 hash f
> | phandle 7e74a6e0 slot 20 hash 18
> | phandle 7e74aab0 slot 30 hash b
> | phandle 7e74f788 slot 8 hash d
> | Used entries: 8, hashed: 29
>
> So the hash array has 64 entries out which only 8 are populated. Using
> hash_32() populates 29 entries.
> Could someone with real hardware verify this?
> I'm not sure how important this performance wise, it looks just like a
> waste using only 1/8 of the array.

 The hash used is based on the assumptions you noted, and as stated in the
 code, that phandle property values are in a contiguous range of 1..n
 (not starting from zero), which is what dtc generates.
>>>
 We knew that for systems that do not match the assumptions that the hash
 will not be optimal.
>>>
>>> If we're going to have the phandle cache it should at least make some
>>> attempt to work across the systems that Linux supports.
>>>
 Unless there is a serious performance problem for
 such systems, I do not want to make the phandle hash code more complicated
 to optimize for these cases.  And the pseries have been performing ok
 without phandle related performance issues that I remember hearing since
 before the cache was added, which could have only helped the performance.
 Yes, if your observations are correct, some memory is being wasted, but
 a 64 entry cache is not very large on a pseries.
>>>
>>> A single line change to use an actual hash function is hardly
>>> complicating it, compared to the amount of code already there.
>>
>> With a dtc generated devicetree, the hash is perfect, with no
>> misses.  That single line change then makes the hash bad for
>> dtc generated devicetrees.
> 
> To quantify that, I did some tests with the hash algo and it's about a
> 10% collision rate using phandle values of 1-$cache_size. There's
> never more than 2 phandles colliding in a slot. The actual impact
> would be less given 100% thrashing seems unlikely.

Thank you for doing the tests.  Actual data helps a lot.

If there is only a 10% collision rate for this case, that does not
sound bad to me.  There is the possibility of current or future
code resulting in ping ponging between two phandle values which
collide in the cache, but that should not be the common case.

However, given a choice between two algorithms, one of which
guarantees no thrashing (the current cache algorithm) and one
which allows a pathologic use case which results in thrashing,
I prefer the first algorithm.  This may seem theoretical, but
I have seen enough pathological code paths in my years of
performance analysis and tuning to be sensitive to this issue.

> 
>> The cache was added to address a 

Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-05 Thread Frank Rowand
On 12/3/19 12:35 PM, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Dec 03, 2019 at 03:03:22PM +1100, Michael Ellerman wrote:
>> Sebastian Andrzej Siewior  writes:
>> I've certainly heard it said that on some OF's the phandle was just ==
>> the address of the internal representation, and I guess maybe for SLOF
>> that is true.
> 
> It is (or was).  In many OFs it is just the effective address of some
> node structure.  SLOF runs with translation off normally.
> 
>> They seem to vary wildly though, eg. on an Apple G5:
> 
> Apple OF runs with translation on usually.  IIRC these are effective
> addresses as well.
> 
> The OF they have on G5 machines is mostly 32-bit, for compatibility is my
> guess (for userland things dealing with addresses from OF, importantly).
> 
>>   $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10
>>   /proc/device-tree/vsp@0,f900/veo@f918/phandle ff970848
>>   /proc/device-tree/vsp@0,f900/phandle ff970360
>>   /proc/device-tree/vsp@0,f900/veo@f908/phandle ff970730
>>   /proc/device-tree/nvram@0,fff04000/phandle ff967fb8
>>   /proc/device-tree/xmodem/phandle ff9655e8
>>   /proc/device-tree/multiboot/phandle ff9504f0
>>   /proc/device-tree/diagnostics/phandle ff965550
>>   /proc/device-tree/options/phandle ff893cf0
>>   /proc/device-tree/openprom/client-services/phandle ff8925b8
>>   /proc/device-tree/openprom/phandle ff892458
>>
>> That machine does not have enough RAM for those to be 32-bit real
>> addresses. I think Apple OF is running in virtual mode though (?), so
>> maybe they are pointers?
> 
> Yes, I think the default is to have 8MB ram at the top of 4GB (which is
> the physical address of the bootrom, btw) for OF.
> 
>> And on an IBM pseries machine they're a bit all over the place:
>>
>>   /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 1040
>>   /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 2005
>>   /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 1030
>>   /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 1020
>>   /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 1010
>>   /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 2003
>>   /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 200a
>>   /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 3108
>>   /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 2001
>>   /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 3106
>>   /proc/device-tree/cpus/ibm,phandle fff8
>>   /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 3104
>>   /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 2008
>>   /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 3102
>>   /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 2006
>>   /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 3100
>>   /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 1008
>>   /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 2004
>>   /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 1048
>>   /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 1038
>>   /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 2002
>>   /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 1028
>>   /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 3107
>>   /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 1018
>>   /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 2000
>>   /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 3105
>>   /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 3103
>>   /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 310a
>>   /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 1000
>>   /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 2007
>>   /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 3101
>>   /proc/device-tree/pci@800201b/ibm,phandle 201b
> 
> Some (the 1000) look like addresses as well.
> 
>>> So the hash array has 64 entries out which only 8 are populated. Using
>>> hash_32() populates 29 entries.
> 
>> On the G5 it's similarly inefficient:
>> [0.007379] OF: of_populate_phandle_cache(242) Used entries: 31, hashed: 
>> 111
> 
>> And some output from a "real" pseries machine (IBM OF), which is
>> slightly better:
>> [0.129467] OF: of_populate_phandle_cache(242) Used entries: 39, hashed: 
>> 81
> 
>> So yeah using hash_32() is quite a bit better in both cases.
> 
> Yup, no surprise there.  And hash_32 is very cheap to compute.
> 
>> And if I'm reading your patch right it would be a single line change to>> 
>> switch, so that seems like it's worth doing to me.
> 
> Agreed!
> 
> Btw.  Some OFs mangle the phandles some way, to make it easier to catch
> people using it as an address (and similarly, mangle ihandles differently,
> so you catch confusion between ihandles and phandles as well).  Like a
> simple xor, with some odd number preferably.  You should assume *nothing*
> about phandles, they are opaque identifiers.

For arm32 machines that use dtc to generate the devicetree, which is 

Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-05 Thread Sebastian Andrzej Siewior
On 2019-12-03 10:56:35 [-0600], Rob Herring wrote:
> > Another possibility would be to make the cache be dependent
> > upon not CONFIG_PPC.  It might be possible to disable the
> > cache with a minimal code change.
> 
> I'd rather not do that.
> 
> And yes, as mentioned earlier I don't like the complexity. I didn't
> from the start and I'm  I'm still of the opinion we should have a
> fixed or 1 time sized true cache (i.e. smaller than total # of
> phandles). That would solve the RT memory allocation and locking issue
> too.
> 
> For reference, the performance difference between the current
> implementation (assuming fixes haven't regressed it) was ~400ms vs. a
> ~340ms improvement with a 64 entry cache (using a mask, not a hash).
> IMO, 340ms improvement was good enough.

Okay. So the 814 phandles would result in an array with 1024 slots. That
would need 4KiB of memory.
What about we go back to the fix 64 slots array but with hash32 for the
lookup? Without the hash we would be 60ms slower during boot (compared
to now, based on ancient data) but then the hash isn't expensive so we
end up with better coverage of the memory on systems which don't have a
plain enumeration of the phandle.

> Rob

Sebastian


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-03 Thread Segher Boessenkool
Hi!

On Tue, Dec 03, 2019 at 03:03:22PM +1100, Michael Ellerman wrote:
> Sebastian Andrzej Siewior  writes:
> I've certainly heard it said that on some OF's the phandle was just ==
> the address of the internal representation, and I guess maybe for SLOF
> that is true.

It is (or was).  In many OFs it is just the effective address of some
node structure.  SLOF runs with translation off normally.

> They seem to vary wildly though, eg. on an Apple G5:

Apple OF runs with translation on usually.  IIRC these are effective
addresses as well.

The OF they have on G5 machines is mostly 32-bit, for compatibility is my
guess (for userland things dealing with addresses from OF, importantly).

>   $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10
>   /proc/device-tree/vsp@0,f900/veo@f918/phandle ff970848
>   /proc/device-tree/vsp@0,f900/phandle ff970360
>   /proc/device-tree/vsp@0,f900/veo@f908/phandle ff970730
>   /proc/device-tree/nvram@0,fff04000/phandle ff967fb8
>   /proc/device-tree/xmodem/phandle ff9655e8
>   /proc/device-tree/multiboot/phandle ff9504f0
>   /proc/device-tree/diagnostics/phandle ff965550
>   /proc/device-tree/options/phandle ff893cf0
>   /proc/device-tree/openprom/client-services/phandle ff8925b8
>   /proc/device-tree/openprom/phandle ff892458
> 
> That machine does not have enough RAM for those to be 32-bit real
> addresses. I think Apple OF is running in virtual mode though (?), so
> maybe they are pointers?

Yes, I think the default is to have 8MB ram at the top of 4GB (which is
the physical address of the bootrom, btw) for OF.

> And on an IBM pseries machine they're a bit all over the place:
> 
>   /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 1040
>   /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 2005
>   /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 1030
>   /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 1020
>   /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 1010
>   /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 2003
>   /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 200a
>   /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 3108
>   /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 2001
>   /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 3106
>   /proc/device-tree/cpus/ibm,phandle fff8
>   /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 3104
>   /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 2008
>   /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 3102
>   /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 2006
>   /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 3100
>   /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 1008
>   /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 2004
>   /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 1048
>   /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 1038
>   /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 2002
>   /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 1028
>   /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 3107
>   /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 1018
>   /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 2000
>   /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 3105
>   /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 3103
>   /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 310a
>   /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 1000
>   /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 2007
>   /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 3101
>   /proc/device-tree/pci@800201b/ibm,phandle 201b

Some (the 1000) look like addresses as well.

> > So the hash array has 64 entries out which only 8 are populated. Using
> > hash_32() populates 29 entries.

> On the G5 it's similarly inefficient:
> [0.007379] OF: of_populate_phandle_cache(242) Used entries: 31, hashed: 
> 111

> And some output from a "real" pseries machine (IBM OF), which is
> slightly better:
> [0.129467] OF: of_populate_phandle_cache(242) Used entries: 39, hashed: 81

> So yeah using hash_32() is quite a bit better in both cases.

Yup, no surprise there.  And hash_32 is very cheap to compute.

> And if I'm reading your patch right it would be a single line change to
> switch, so that seems like it's worth doing to me.

Agreed!

Btw.  Some OFs mangle the phandles some way, to make it easier to catch
people using it as an address (and similarly, mangle ihandles differently,
so you catch confusion between ihandles and phandles as well).  Like a
simple xor, with some odd number preferably.  You should assume *nothing*
about phandles, they are opaque identifiers.


Segher


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-03 Thread Rob Herring
On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand  wrote:
>
> On 12/2/19 10:12 PM, Michael Ellerman wrote:
> > Frank Rowand  writes:
> >> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
> >>> I've been looking at phandle_cache and noticed the following: The raw
> >>> phandle value as generated by dtc starts at zero and is incremented by
> >>> one for each phandle entry. The qemu pSeries model is using Slof (which
> >>> is probably the same thing as used on real hardware) and this looks like
> >>> a poiner value for the phandle.
> >>> With
> >>> qemu-system-ppc64le -m 16G -machine pseries -smp 8
> >>>
> >>> I got the following output:
> >>> | entries: 64
> >>> | phandle 7e732468 slot 28 hash c
> >>> | phandle 7e732ad0 slot 10 hash 27
> >>> | phandle 7e732ee8 slot 28 hash 3a
> >>> | phandle 7e734160 slot 20 hash 36
> >>> | phandle 7e734318 slot 18 hash 3a
> >>> | phandle 7e734428 slot 28 hash 33
> >>> | phandle 7e734538 slot 38 hash 2c
> >>> | phandle 7e734850 slot 10 hash e
> >>> | phandle 7e735220 slot 20 hash 2d
> >>> | phandle 7e735bf0 slot 30 hash d
> >>> | phandle 7e7365c0 slot 0 hash 2d
> >>> | phandle 7e736f90 slot 10 hash d
> >>> | phandle 7e737960 slot 20 hash 2d
> >>> | phandle 7e738330 slot 30 hash d
> >>> | phandle 7e738d00 slot 0 hash 2d
> >>> | phandle 7e739730 slot 30 hash 38
> >>> | phandle 7e73bd08 slot 8 hash 17
> >>> | phandle 7e73c2e0 slot 20 hash 32
> >>> | phandle 7e73c7f8 slot 38 hash 37
> >>> | phandle 7e782420 slot 20 hash 13
> >>> | phandle 7e782ed8 slot 18 hash 1b
> >>> | phandle 7e73ce28 slot 28 hash 39
> >>> | phandle 7e73d390 slot 10 hash 22
> >>> | phandle 7e73d9a8 slot 28 hash 1a
> >>> | phandle 7e73dc28 slot 28 hash 37
> >>> | phandle 7e73de00 slot 0 hash a
> >>> | phandle 7e73e028 slot 28 hash 0
> >>> | phandle 7e7621a8 slot 28 hash 36
> >>> | phandle 7e73e458 slot 18 hash 1e
> >>> | phandle 7e73e608 slot 8 hash 1e
> >>> | phandle 7e740078 slot 38 hash 28
> >>> | phandle 7e740180 slot 0 hash 1d
> >>> | phandle 7e740240 slot 0 hash 33
> >>> | phandle 7e740348 slot 8 hash 29
> >>> | phandle 7e740410 slot 10 hash 2
> >>> | phandle 7e740eb0 slot 30 hash 3e
> >>> | phandle 7e745390 slot 10 hash 33
> >>> | phandle 7e747b08 slot 8 hash c
> >>> | phandle 7e748528 slot 28 hash f
> >>> | phandle 7e74a6e0 slot 20 hash 18
> >>> | phandle 7e74aab0 slot 30 hash b
> >>> | phandle 7e74f788 slot 8 hash d
> >>> | Used entries: 8, hashed: 29
> >>>
> >>> So the hash array has 64 entries out which only 8 are populated. Using
> >>> hash_32() populates 29 entries.
> >>> Could someone with real hardware verify this?
> >>> I'm not sure how important this performance wise, it looks just like a
> >>> waste using only 1/8 of the array.
> >>
> >> The hash used is based on the assumptions you noted, and as stated in the
> >> code, that phandle property values are in a contiguous range of 1..n
> >> (not starting from zero), which is what dtc generates.
> >
> >> We knew that for systems that do not match the assumptions that the hash
> >> will not be optimal.
> >
> > If we're going to have the phandle cache it should at least make some
> > attempt to work across the systems that Linux supports.
> >
> >> Unless there is a serious performance problem for
> >> such systems, I do not want to make the phandle hash code more complicated
> >> to optimize for these cases.  And the pseries have been performing ok
> >> without phandle related performance issues that I remember hearing since
> >> before the cache was added, which could have only helped the performance.
> >> Yes, if your observations are correct, some memory is being wasted, but
> >> a 64 entry cache is not very large on a pseries.
> >
> > A single line change to use an actual hash function is hardly
> > complicating it, compared to the amount of code already there.
>
> With a dtc generated devicetree, the hash is perfect, with no
> misses.  That single line change then makes the hash bad for
> dtc generated devicetrees.

To quantify that, I did some tests with the hash algo and it's about a
10% collision rate using phandle values of 1-$cache_size. There's
never more than 2 phandles colliding in a slot. The actual impact
would be less given 100% thrashing seems unlikely.

> The cache was added to address a problem with a system with a
> dtc generated devicetree.

The cache was added to address a problem with a system with a large
number of nodes and phandles (814 phandles in the reported case). dtc
happens to be used in that case.

> I had not heard of any phandle
> lookup performance issues on ppc systems.  An imperfect hash
> for ppc should not make the ppc performance worse (unless
> every single phandle value hashes to a single bucket).  So
> the ppc performance is likely better than it was before
> the hash was added, even without an optimal hash algorithm.
>
> So the change would not be a single line change.  It would
> be a change to use different hash algorithms for dtc
> generated device trees versus other device trees.
>
> Another 

Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-02 Thread Frank Rowand
On 12/2/19 10:12 PM, Michael Ellerman wrote:
> Frank Rowand  writes:
>> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
>>> I've been looking at phandle_cache and noticed the following: The raw
>>> phandle value as generated by dtc starts at zero and is incremented by
>>> one for each phandle entry. The qemu pSeries model is using Slof (which
>>> is probably the same thing as used on real hardware) and this looks like
>>> a poiner value for the phandle.
>>> With
>>> qemu-system-ppc64le -m 16G -machine pseries -smp 8 
>>>
>>> I got the following output:
>>> | entries: 64
>>> | phandle 7e732468 slot 28 hash c
>>> | phandle 7e732ad0 slot 10 hash 27
>>> | phandle 7e732ee8 slot 28 hash 3a
>>> | phandle 7e734160 slot 20 hash 36
>>> | phandle 7e734318 slot 18 hash 3a
>>> | phandle 7e734428 slot 28 hash 33
>>> | phandle 7e734538 slot 38 hash 2c
>>> | phandle 7e734850 slot 10 hash e
>>> | phandle 7e735220 slot 20 hash 2d
>>> | phandle 7e735bf0 slot 30 hash d
>>> | phandle 7e7365c0 slot 0 hash 2d
>>> | phandle 7e736f90 slot 10 hash d
>>> | phandle 7e737960 slot 20 hash 2d
>>> | phandle 7e738330 slot 30 hash d
>>> | phandle 7e738d00 slot 0 hash 2d
>>> | phandle 7e739730 slot 30 hash 38
>>> | phandle 7e73bd08 slot 8 hash 17
>>> | phandle 7e73c2e0 slot 20 hash 32
>>> | phandle 7e73c7f8 slot 38 hash 37
>>> | phandle 7e782420 slot 20 hash 13
>>> | phandle 7e782ed8 slot 18 hash 1b
>>> | phandle 7e73ce28 slot 28 hash 39
>>> | phandle 7e73d390 slot 10 hash 22
>>> | phandle 7e73d9a8 slot 28 hash 1a
>>> | phandle 7e73dc28 slot 28 hash 37
>>> | phandle 7e73de00 slot 0 hash a
>>> | phandle 7e73e028 slot 28 hash 0
>>> | phandle 7e7621a8 slot 28 hash 36
>>> | phandle 7e73e458 slot 18 hash 1e
>>> | phandle 7e73e608 slot 8 hash 1e
>>> | phandle 7e740078 slot 38 hash 28
>>> | phandle 7e740180 slot 0 hash 1d
>>> | phandle 7e740240 slot 0 hash 33
>>> | phandle 7e740348 slot 8 hash 29
>>> | phandle 7e740410 slot 10 hash 2
>>> | phandle 7e740eb0 slot 30 hash 3e
>>> | phandle 7e745390 slot 10 hash 33
>>> | phandle 7e747b08 slot 8 hash c
>>> | phandle 7e748528 slot 28 hash f
>>> | phandle 7e74a6e0 slot 20 hash 18
>>> | phandle 7e74aab0 slot 30 hash b
>>> | phandle 7e74f788 slot 8 hash d
>>> | Used entries: 8, hashed: 29
>>>
>>> So the hash array has 64 entries out which only 8 are populated. Using
>>> hash_32() populates 29 entries.
>>> Could someone with real hardware verify this?
>>> I'm not sure how important this performance wise, it looks just like a
>>> waste using only 1/8 of the array.
>>
>> The hash used is based on the assumptions you noted, and as stated in the
>> code, that phandle property values are in a contiguous range of 1..n
>> (not starting from zero), which is what dtc generates.
> 
>> We knew that for systems that do not match the assumptions that the hash
>> will not be optimal.
> 
> If we're going to have the phandle cache it should at least make some
> attempt to work across the systems that Linux supports.
> 
>> Unless there is a serious performance problem for
>> such systems, I do not want to make the phandle hash code more complicated
>> to optimize for these cases.  And the pseries have been performing ok
>> without phandle related performance issues that I remember hearing since
>> before the cache was added, which could have only helped the performance.
>> Yes, if your observations are correct, some memory is being wasted, but
>> a 64 entry cache is not very large on a pseries.
> 
> A single line change to use an actual hash function is hardly
> complicating it, compared to the amount of code already there.

With a dtc generated devicetree, the hash is perfect, with no
misses.  That single line change then makes the hash bad for
dtc generated devicetrees.

The cache was added to address a problem with a system with a
dtc generated devicetree.  I had not heard of any phandle
lookup performance issues on ppc systems.  An imperfect hash
for ppc should not make the ppc performance worse (unless
every single phandle value hashes to a single bucket).  So
the ppc performance is likely better than it was before
the hash was added, even without an optimal hash algorithm.

So the change would not be a single line change.  It would
be a change to use different hash algorithms for dtc
generated device trees versus other device trees.

Another possibility would be to make the cache be dependent
upon not CONFIG_PPC.  It might be possible to disable the
cache with a minimal code change.

> 
> cheers
> 



Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-02 Thread Michael Ellerman
Frank Rowand  writes:
> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
>> I've been looking at phandle_cache and noticed the following: The raw
>> phandle value as generated by dtc starts at zero and is incremented by
>> one for each phandle entry. The qemu pSeries model is using Slof (which
>> is probably the same thing as used on real hardware) and this looks like
>> a poiner value for the phandle.
>> With
>>  qemu-system-ppc64le -m 16G -machine pseries -smp 8 
>> 
>> I got the following output:
>> | entries: 64
>> | phandle 7e732468 slot 28 hash c
>> | phandle 7e732ad0 slot 10 hash 27
>> | phandle 7e732ee8 slot 28 hash 3a
>> | phandle 7e734160 slot 20 hash 36
>> | phandle 7e734318 slot 18 hash 3a
>> | phandle 7e734428 slot 28 hash 33
>> | phandle 7e734538 slot 38 hash 2c
>> | phandle 7e734850 slot 10 hash e
>> | phandle 7e735220 slot 20 hash 2d
>> | phandle 7e735bf0 slot 30 hash d
>> | phandle 7e7365c0 slot 0 hash 2d
>> | phandle 7e736f90 slot 10 hash d
>> | phandle 7e737960 slot 20 hash 2d
>> | phandle 7e738330 slot 30 hash d
>> | phandle 7e738d00 slot 0 hash 2d
>> | phandle 7e739730 slot 30 hash 38
>> | phandle 7e73bd08 slot 8 hash 17
>> | phandle 7e73c2e0 slot 20 hash 32
>> | phandle 7e73c7f8 slot 38 hash 37
>> | phandle 7e782420 slot 20 hash 13
>> | phandle 7e782ed8 slot 18 hash 1b
>> | phandle 7e73ce28 slot 28 hash 39
>> | phandle 7e73d390 slot 10 hash 22
>> | phandle 7e73d9a8 slot 28 hash 1a
>> | phandle 7e73dc28 slot 28 hash 37
>> | phandle 7e73de00 slot 0 hash a
>> | phandle 7e73e028 slot 28 hash 0
>> | phandle 7e7621a8 slot 28 hash 36
>> | phandle 7e73e458 slot 18 hash 1e
>> | phandle 7e73e608 slot 8 hash 1e
>> | phandle 7e740078 slot 38 hash 28
>> | phandle 7e740180 slot 0 hash 1d
>> | phandle 7e740240 slot 0 hash 33
>> | phandle 7e740348 slot 8 hash 29
>> | phandle 7e740410 slot 10 hash 2
>> | phandle 7e740eb0 slot 30 hash 3e
>> | phandle 7e745390 slot 10 hash 33
>> | phandle 7e747b08 slot 8 hash c
>> | phandle 7e748528 slot 28 hash f
>> | phandle 7e74a6e0 slot 20 hash 18
>> | phandle 7e74aab0 slot 30 hash b
>> | phandle 7e74f788 slot 8 hash d
>> | Used entries: 8, hashed: 29
>> 
>> So the hash array has 64 entries out which only 8 are populated. Using
>> hash_32() populates 29 entries.
>> Could someone with real hardware verify this?
>> I'm not sure how important this performance wise, it looks just like a
>> waste using only 1/8 of the array.
>
> The hash used is based on the assumptions you noted, and as stated in the
> code, that phandle property values are in a contiguous range of 1..n
> (not starting from zero), which is what dtc generates.

> We knew that for systems that do not match the assumptions that the hash
> will not be optimal.

If we're going to have the phandle cache it should at least make some
attempt to work across the systems that Linux supports.

> Unless there is a serious performance problem for
> such systems, I do not want to make the phandle hash code more complicated
> to optimize for these cases.  And the pseries have been performing ok
> without phandle related performance issues that I remember hearing since
> before the cache was added, which could have only helped the performance.
> Yes, if your observations are correct, some memory is being wasted, but
> a 64 entry cache is not very large on a pseries.

A single line change to use an actual hash function is hardly
complicating it, compared to the amount of code already there.

cheers


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-02 Thread Michael Ellerman
Sebastian Andrzej Siewior  writes:
> I've been looking at phandle_cache and noticed the following: The raw
> phandle value as generated by dtc starts at zero and is incremented by
> one for each phandle entry. The qemu pSeries model is using Slof (which
> is probably the same thing as used on real hardware) and this looks like
> a poiner value for the phandle.

We don't use SLOF on bare metal these days.

I've certainly heard it said that on some OF's the phandle was just ==
the address of the internal representation, and I guess maybe for SLOF
that is true.

They seem to vary wildly though, eg. on an Apple G5:

  $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10
  /proc/device-tree/vsp@0,f900/veo@f918/phandle ff970848
  /proc/device-tree/vsp@0,f900/phandle ff970360
  /proc/device-tree/vsp@0,f900/veo@f908/phandle ff970730
  /proc/device-tree/nvram@0,fff04000/phandle ff967fb8
  /proc/device-tree/xmodem/phandle ff9655e8
  /proc/device-tree/multiboot/phandle ff9504f0
  /proc/device-tree/diagnostics/phandle ff965550
  /proc/device-tree/options/phandle ff893cf0
  /proc/device-tree/openprom/client-services/phandle ff8925b8
  /proc/device-tree/openprom/phandle ff892458

That machine does not have enough RAM for those to be 32-bit real
addresses. I think Apple OF is running in virtual mode though (?), so
maybe they are pointers?

And on an IBM pseries machine they're a bit all over the place:

  /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 1040
  /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 2005
  /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 1030
  /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 1020
  /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 1010
  /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 2003
  /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 200a
  /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 3108
  /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 2001
  /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 3106
  /proc/device-tree/cpus/ibm,phandle fff8
  /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 3104
  /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 2008
  /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 3102
  /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 2006
  /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 3100
  /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 1008
  /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 2004
  /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 1048
  /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 1038
  /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 2002
  /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 1028
  /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 3107
  /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 1018
  /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 2000
  /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 3105
  /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 3103
  /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 310a
  /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 1000
  /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 2007
  /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 3101
  /proc/device-tree/pci@800201b/ibm,phandle 201b


> With
>   qemu-system-ppc64le -m 16G -machine pseries -smp 8 
>
> I got the following output:
> | entries: 64
> | phandle 7e732468 slot 28 hash c
> | phandle 7e732ad0 slot 10 hash 27
> | phandle 7e732ee8 slot 28 hash 3a
> | phandle 7e734160 slot 20 hash 36
> | phandle 7e734318 slot 18 hash 3a
> | phandle 7e734428 slot 28 hash 33
> | phandle 7e734538 slot 38 hash 2c
> | phandle 7e734850 slot 10 hash e
> | phandle 7e735220 slot 20 hash 2d
> | phandle 7e735bf0 slot 30 hash d
> | phandle 7e7365c0 slot 0 hash 2d
> | phandle 7e736f90 slot 10 hash d
> | phandle 7e737960 slot 20 hash 2d
> | phandle 7e738330 slot 30 hash d
> | phandle 7e738d00 slot 0 hash 2d
> | phandle 7e739730 slot 30 hash 38
> | phandle 7e73bd08 slot 8 hash 17
> | phandle 7e73c2e0 slot 20 hash 32
> | phandle 7e73c7f8 slot 38 hash 37
> | phandle 7e782420 slot 20 hash 13
> | phandle 7e782ed8 slot 18 hash 1b
> | phandle 7e73ce28 slot 28 hash 39
> | phandle 7e73d390 slot 10 hash 22
> | phandle 7e73d9a8 slot 28 hash 1a
> | phandle 7e73dc28 slot 28 hash 37
> | phandle 7e73de00 slot 0 hash a
> | phandle 7e73e028 slot 28 hash 0
> | phandle 7e7621a8 slot 28 hash 36
> | phandle 7e73e458 slot 18 hash 1e
> | phandle 7e73e608 slot 8 hash 1e
> | phandle 7e740078 slot 38 hash 28
> | phandle 7e740180 slot 0 hash 1d
> | phandle 7e740240 slot 0 hash 33
> | phandle 7e740348 slot 8 hash 29
> | phandle 7e740410 slot 10 hash 2
> | phandle 7e740eb0 slot 30 hash 3e
> | phandle 7e745390 slot 10 hash 33
> | phandle 7e747b08 slot 8 hash c
> | phandle 7e748528 slot 28 hash f
> | phandle 7e74a6e0 slot 20 hash 18
> | phandle 7e74aab0 slot 30 

Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-02 Thread Sebastian Andrzej Siewior
On 2019-11-29 20:14:47 [-0600], Frank Rowand wrote:
> The hash used is based on the assumptions you noted, and as stated in the
> code, that phandle property values are in a contiguous range of 1..n
> (not starting from zero), which is what dtc generates.
> 
> We knew that for systems that do not match the assumptions that the hash
> will not be optimal.  Unless there is a serious performance problem for
> such systems, I do not want to make the phandle hash code more complicated
> to optimize for these cases.  And the pseries have been performing ok
> without phandle related performance issues that I remember hearing since
> before the cache was added, which could have only helped the performance.
> Yes, if your observations are correct, some memory is being wasted, but
> a 64 entry cache is not very large on a pseries.

okay, so it is nothing new and everyone is aware of the situation. I
move on then :)

> -Frank

Sebastian


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-11-29 Thread Frank Rowand
On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
> I've been looking at phandle_cache and noticed the following: The raw
> phandle value as generated by dtc starts at zero and is incremented by
> one for each phandle entry. The qemu pSeries model is using Slof (which
> is probably the same thing as used on real hardware) and this looks like
> a poiner value for the phandle.
> With
>   qemu-system-ppc64le -m 16G -machine pseries -smp 8 
> 
> I got the following output:
> | entries: 64
> | phandle 7e732468 slot 28 hash c
> | phandle 7e732ad0 slot 10 hash 27
> | phandle 7e732ee8 slot 28 hash 3a
> | phandle 7e734160 slot 20 hash 36
> | phandle 7e734318 slot 18 hash 3a
> | phandle 7e734428 slot 28 hash 33
> | phandle 7e734538 slot 38 hash 2c
> | phandle 7e734850 slot 10 hash e
> | phandle 7e735220 slot 20 hash 2d
> | phandle 7e735bf0 slot 30 hash d
> | phandle 7e7365c0 slot 0 hash 2d
> | phandle 7e736f90 slot 10 hash d
> | phandle 7e737960 slot 20 hash 2d
> | phandle 7e738330 slot 30 hash d
> | phandle 7e738d00 slot 0 hash 2d
> | phandle 7e739730 slot 30 hash 38
> | phandle 7e73bd08 slot 8 hash 17
> | phandle 7e73c2e0 slot 20 hash 32
> | phandle 7e73c7f8 slot 38 hash 37
> | phandle 7e782420 slot 20 hash 13
> | phandle 7e782ed8 slot 18 hash 1b
> | phandle 7e73ce28 slot 28 hash 39
> | phandle 7e73d390 slot 10 hash 22
> | phandle 7e73d9a8 slot 28 hash 1a
> | phandle 7e73dc28 slot 28 hash 37
> | phandle 7e73de00 slot 0 hash a
> | phandle 7e73e028 slot 28 hash 0
> | phandle 7e7621a8 slot 28 hash 36
> | phandle 7e73e458 slot 18 hash 1e
> | phandle 7e73e608 slot 8 hash 1e
> | phandle 7e740078 slot 38 hash 28
> | phandle 7e740180 slot 0 hash 1d
> | phandle 7e740240 slot 0 hash 33
> | phandle 7e740348 slot 8 hash 29
> | phandle 7e740410 slot 10 hash 2
> | phandle 7e740eb0 slot 30 hash 3e
> | phandle 7e745390 slot 10 hash 33
> | phandle 7e747b08 slot 8 hash c
> | phandle 7e748528 slot 28 hash f
> | phandle 7e74a6e0 slot 20 hash 18
> | phandle 7e74aab0 slot 30 hash b
> | phandle 7e74f788 slot 8 hash d
> | Used entries: 8, hashed: 29
> 
> So the hash array has 64 entries out which only 8 are populated. Using
> hash_32() populates 29 entries.
> Could someone with real hardware verify this?
> I'm not sure how important this performance wise, it looks just like a
> waste using only 1/8 of the array.

The hash used is based on the assumptions you noted, and as stated in the
code, that phandle property values are in a contiguous range of 1..n
(not starting from zero), which is what dtc generates.

We knew that for systems that do not match the assumptions that the hash
will not be optimal.  Unless there is a serious performance problem for
such systems, I do not want to make the phandle hash code more complicated
to optimize for these cases.  And the pseries have been performing ok
without phandle related performance issues that I remember hearing since
before the cache was added, which could have only helped the performance.
Yes, if your observations are correct, some memory is being wasted, but
a 64 entry cache is not very large on a pseries.

There is already some push back from Rob that the existing code is more
complex than needed (eg variable cache size).

-Frank

> 
> The patch used for testing:
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 1d667eb730e19..2640d4bc81a9a 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -197,6 +197,7 @@ void of_populate_phandle_cache(void)
>   u32 cache_entries;
>   struct device_node *np;
>   u32 phandles = 0;
> + struct device_node **cache2;
>  
>   raw_spin_lock_irqsave(_lock, flags);
>  
> @@ -214,14 +215,32 @@ void of_populate_phandle_cache(void)
>  
>   phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
>   GFP_ATOMIC);
> + cache2 = kcalloc(cache_entries, sizeof(*phandle_cache), GFP_ATOMIC);
>   if (!phandle_cache)
>   goto out;
>  
> + pr_err("%s(%d) entries: %d\n", __func__, __LINE__, cache_entries);
>   for_each_of_allnodes(np)
>   if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
> + int slot;
>   of_node_get(np);
>   phandle_cache[np->phandle & phandle_cache_mask] = np;
> + slot = hash_32(np->phandle, __ffs(cache_entries));
> + cache2[slot] = np;
> + pr_err("%s(%d) phandle %x slot %x hash %x\n", __func__, 
> __LINE__,
> +np->phandle, np->phandle & phandle_cache_mask, 
> slot);
>   }
> + {
> + int i, filled = 0, filled_hash = 0;
> +
> + for (i = 0; i < cache_entries; i++) {
> + if (phandle_cache[i])
> + filled++;
> + if (cache2[i])
> + filled_hash++;
> + }
> + pr_err("%s(%d) Used entries: %d, hashed: %d\n", __func__, 
> 

[RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-11-29 Thread Sebastian Andrzej Siewior
I've been looking at phandle_cache and noticed the following: The raw
phandle value as generated by dtc starts at zero and is incremented by
one for each phandle entry. The qemu pSeries model is using Slof (which
is probably the same thing as used on real hardware) and this looks like
a poiner value for the phandle.
With
qemu-system-ppc64le -m 16G -machine pseries -smp 8 

I got the following output:
| entries: 64
| phandle 7e732468 slot 28 hash c
| phandle 7e732ad0 slot 10 hash 27
| phandle 7e732ee8 slot 28 hash 3a
| phandle 7e734160 slot 20 hash 36
| phandle 7e734318 slot 18 hash 3a
| phandle 7e734428 slot 28 hash 33
| phandle 7e734538 slot 38 hash 2c
| phandle 7e734850 slot 10 hash e
| phandle 7e735220 slot 20 hash 2d
| phandle 7e735bf0 slot 30 hash d
| phandle 7e7365c0 slot 0 hash 2d
| phandle 7e736f90 slot 10 hash d
| phandle 7e737960 slot 20 hash 2d
| phandle 7e738330 slot 30 hash d
| phandle 7e738d00 slot 0 hash 2d
| phandle 7e739730 slot 30 hash 38
| phandle 7e73bd08 slot 8 hash 17
| phandle 7e73c2e0 slot 20 hash 32
| phandle 7e73c7f8 slot 38 hash 37
| phandle 7e782420 slot 20 hash 13
| phandle 7e782ed8 slot 18 hash 1b
| phandle 7e73ce28 slot 28 hash 39
| phandle 7e73d390 slot 10 hash 22
| phandle 7e73d9a8 slot 28 hash 1a
| phandle 7e73dc28 slot 28 hash 37
| phandle 7e73de00 slot 0 hash a
| phandle 7e73e028 slot 28 hash 0
| phandle 7e7621a8 slot 28 hash 36
| phandle 7e73e458 slot 18 hash 1e
| phandle 7e73e608 slot 8 hash 1e
| phandle 7e740078 slot 38 hash 28
| phandle 7e740180 slot 0 hash 1d
| phandle 7e740240 slot 0 hash 33
| phandle 7e740348 slot 8 hash 29
| phandle 7e740410 slot 10 hash 2
| phandle 7e740eb0 slot 30 hash 3e
| phandle 7e745390 slot 10 hash 33
| phandle 7e747b08 slot 8 hash c
| phandle 7e748528 slot 28 hash f
| phandle 7e74a6e0 slot 20 hash 18
| phandle 7e74aab0 slot 30 hash b
| phandle 7e74f788 slot 8 hash d
| Used entries: 8, hashed: 29

So the hash array has 64 entries out which only 8 are populated. Using
hash_32() populates 29 entries.
Could someone with real hardware verify this?
I'm not sure how important this performance wise, it looks just like a
waste using only 1/8 of the array.

The patch used for testing:

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 1d667eb730e19..2640d4bc81a9a 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -197,6 +197,7 @@ void of_populate_phandle_cache(void)
u32 cache_entries;
struct device_node *np;
u32 phandles = 0;
+   struct device_node **cache2;
 
raw_spin_lock_irqsave(_lock, flags);
 
@@ -214,14 +215,32 @@ void of_populate_phandle_cache(void)
 
phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
GFP_ATOMIC);
+   cache2 = kcalloc(cache_entries, sizeof(*phandle_cache), GFP_ATOMIC);
if (!phandle_cache)
goto out;
 
+   pr_err("%s(%d) entries: %d\n", __func__, __LINE__, cache_entries);
for_each_of_allnodes(np)
if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
+   int slot;
of_node_get(np);
phandle_cache[np->phandle & phandle_cache_mask] = np;
+   slot = hash_32(np->phandle, __ffs(cache_entries));
+   cache2[slot] = np;
+   pr_err("%s(%d) phandle %x slot %x hash %x\n", __func__, 
__LINE__,
+  np->phandle, np->phandle & phandle_cache_mask, 
slot);
}
+   {
+   int i, filled = 0, filled_hash = 0;
+
+   for (i = 0; i < cache_entries; i++) {
+   if (phandle_cache[i])
+   filled++;
+   if (cache2[i])
+   filled_hash++;
+   }
+   pr_err("%s(%d) Used entries: %d, hashed: %d\n", __func__, 
__LINE__, filled, filled_hash);
+   }
 
 out:
raw_spin_unlock_irqrestore(_lock, flags);

Sebastian