Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On Tue, Dec 10, 2019 at 2:17 AM Frank Rowand wrote: > > On 12/9/19 7:51 PM, Rob Herring wrote: > > On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior > > wrote: > >> > >> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote: > >>> Is there a memory usage issue for the systems that led to this thread? > >> > >> No, no memory issue led to this thread. I was just testing my patch and > >> I assumed that I did something wrong in the counting/lock drop/lock > >> acquire/allocate path because the array was hardly used. So I started to > >> look deeper… > >> Once I figured out everything was fine, I was curious if everyone is > >> aware of the different phandle creation by dtc vs POWER. And I posted > >> the mail in the thread. > >> Once you confirmed that everything is "known / not an issue" I was ready > >> to take off [0]. > >> > >> Later more replies came in such as one mail [1] from Rob describing the > >> original reason with 814 phandles. _Here_ I was just surprised that 1024 > >> were used over 64 entries for a benefit of 60ms. I understand that this > >> is low concern for you because that memory is released if modules are > >> not enabled. I usually see that module support is left enabled. > >> > >> However, Rob suggested / asked about the fixed size array (this is how I > >> understood it): > >> |And yes, as mentioned earlier I don't like the complexity. I didn't > >> |from the start and I'm I'm still of the opinion we should have a > >> |fixed or 1 time sized true cache (i.e. smaller than total # of > >> |phandles). That would solve the RT memory allocation and locking issue > >> |too. > >> > >> so I attempted to ask if we should have the fixed size array maybe > >> with the hash_32() instead the mask. This would make my other patch > >> obsolete because the fixed size array should not have a RT issue. The > >> hash_32() part here would address the POWER issue where the cache is > >> currently not used efficiently. > >> > >> If you want instead to keep things as-is then this is okay from my side. > >> If you want to keep this cache off on POWER then I could contribute a > >> patch doing so. > > > > It turns out there's actually a bug in the current implementation. If > > we have multiple phandles with the same mask, then we leak node > > references if we miss in the cache and re-assign the cache entry. > > Aaargh. Patch sent. > > > Easily fixed I suppose, but holding a ref count for a cached entry > > seems wrong. That means we never have a ref count of 0 on every node > > with a phandle. > > It will go to zero when the cache is freed. > > My memory is that we free the cache as part of removing an overlay. I'll > verify whether my memory is correct. Yes, as part of having entries for every phandle we release and realloc when number of phandles changes. If the size is fixed, then we can stop doing that. We only need to remove entries in of_detach_node() as that should always happen before nodes are removed, right? Rob
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/10/19 2:17 AM, Frank Rowand wrote: > On 12/9/19 7:51 PM, Rob Herring wrote: >> On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior >> wrote: >>> >>> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote: Is there a memory usage issue for the systems that led to this thread? >>> >>> No, no memory issue led to this thread. I was just testing my patch and >>> I assumed that I did something wrong in the counting/lock drop/lock >>> acquire/allocate path because the array was hardly used. So I started to >>> look deeper… >>> Once I figured out everything was fine, I was curious if everyone is >>> aware of the different phandle creation by dtc vs POWER. And I posted >>> the mail in the thread. >>> Once you confirmed that everything is "known / not an issue" I was ready >>> to take off [0]. >>> >>> Later more replies came in such as one mail [1] from Rob describing the >>> original reason with 814 phandles. _Here_ I was just surprised that 1024 >>> were used over 64 entries for a benefit of 60ms. I understand that this >>> is low concern for you because that memory is released if modules are >>> not enabled. I usually see that module support is left enabled. >>> >>> However, Rob suggested / asked about the fixed size array (this is how I >>> understood it): >>> |And yes, as mentioned earlier I don't like the complexity. I didn't >>> |from the start and I'm I'm still of the opinion we should have a >>> |fixed or 1 time sized true cache (i.e. smaller than total # of >>> |phandles). That would solve the RT memory allocation and locking issue >>> |too. >>> >>> so I attempted to ask if we should have the fixed size array maybe >>> with the hash_32() instead the mask. This would make my other patch >>> obsolete because the fixed size array should not have a RT issue. The >>> hash_32() part here would address the POWER issue where the cache is >>> currently not used efficiently. >>> >>> If you want instead to keep things as-is then this is okay from my side. >>> If you want to keep this cache off on POWER then I could contribute a >>> patch doing so. >> >> It turns out there's actually a bug in the current implementation. If >> we have multiple phandles with the same mask, then we leak node >> references if we miss in the cache and re-assign the cache entry. > > Aaargh. Patch sent. > >> Easily fixed I suppose, but holding a ref count for a cached entry >> seems wrong. That means we never have a ref count of 0 on every node >> with a phandle. > > It will go to zero when the cache is freed. > > My memory is that we free the cache as part of removing an overlay. I'll > verify whether my memory is correct. And I'll look at non-overlay use of dynamic devicetree too. -Frank > > -Frank > > >> >> I've done some more experiments with the performance. I've come to the >> conclusion that just measuring boot time is not a good way at least if >> the time is not a significant percentage of the total. I couldn't get >> any measurable data. I'm using a RK3399 Rock960 board. It has 190 >> phandles. I get about 1500 calls to of_find_node_by_phandle() during >> boot. Note that about the first 300 are before we have any timekeeping >> (the prior measurements also would not account for this). Those calls >> have no cache in the current implementation and are cached in my >> implementation. >> >> no cache: 20124 us >> current cache: 819 us >> >> new cache (64 entry): 4342 us >> new cache (128 entry): 2875 us >> new cache (256 entry): 1235 us >> >> To get some idea on the time spent before timekeeping is up, if we >> take the avg miss time is ~17us (20124/1200), then we're spending >> about ~5ms on lookups before the cache is enabled. I'd estimate the >> new cache takes ~400us before timekeeping is up as there's 11 misses >> early. >> >> >From these numbers, it seems the miss rate has a significant impact on >> performance for the different sizes. But taken from the original 20+ >> ms, they all look like good improvement. >> >> Rob >> > >
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/9/19 7:51 PM, Rob Herring wrote: > On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior > wrote: >> >> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote: >>> Is there a memory usage issue for the systems that led to this thread? >> >> No, no memory issue led to this thread. I was just testing my patch and >> I assumed that I did something wrong in the counting/lock drop/lock >> acquire/allocate path because the array was hardly used. So I started to >> look deeper… >> Once I figured out everything was fine, I was curious if everyone is >> aware of the different phandle creation by dtc vs POWER. And I posted >> the mail in the thread. >> Once you confirmed that everything is "known / not an issue" I was ready >> to take off [0]. >> >> Later more replies came in such as one mail [1] from Rob describing the >> original reason with 814 phandles. _Here_ I was just surprised that 1024 >> were used over 64 entries for a benefit of 60ms. I understand that this >> is low concern for you because that memory is released if modules are >> not enabled. I usually see that module support is left enabled. >> >> However, Rob suggested / asked about the fixed size array (this is how I >> understood it): >> |And yes, as mentioned earlier I don't like the complexity. I didn't >> |from the start and I'm I'm still of the opinion we should have a >> |fixed or 1 time sized true cache (i.e. smaller than total # of >> |phandles). That would solve the RT memory allocation and locking issue >> |too. >> >> so I attempted to ask if we should have the fixed size array maybe >> with the hash_32() instead the mask. This would make my other patch >> obsolete because the fixed size array should not have a RT issue. The >> hash_32() part here would address the POWER issue where the cache is >> currently not used efficiently. >> >> If you want instead to keep things as-is then this is okay from my side. >> If you want to keep this cache off on POWER then I could contribute a >> patch doing so. > > It turns out there's actually a bug in the current implementation. If > we have multiple phandles with the same mask, then we leak node > references if we miss in the cache and re-assign the cache entry. Aaargh. Patch sent. > Easily fixed I suppose, but holding a ref count for a cached entry > seems wrong. That means we never have a ref count of 0 on every node > with a phandle. It will go to zero when the cache is freed. My memory is that we free the cache as part of removing an overlay. I'll verify whether my memory is correct. -Frank > > I've done some more experiments with the performance. I've come to the > conclusion that just measuring boot time is not a good way at least if > the time is not a significant percentage of the total. I couldn't get > any measurable data. I'm using a RK3399 Rock960 board. It has 190 > phandles. I get about 1500 calls to of_find_node_by_phandle() during > boot. Note that about the first 300 are before we have any timekeeping > (the prior measurements also would not account for this). Those calls > have no cache in the current implementation and are cached in my > implementation. > > no cache: 20124 us > current cache: 819 us > > new cache (64 entry): 4342 us > new cache (128 entry): 2875 us > new cache (256 entry): 1235 us > > To get some idea on the time spent before timekeeping is up, if we > take the avg miss time is ~17us (20124/1200), then we're spending > about ~5ms on lookups before the cache is enabled. I'd estimate the > new cache takes ~400us before timekeeping is up as there's 11 misses > early. > >>From these numbers, it seems the miss rate has a significant impact on > performance for the different sizes. But taken from the original 20+ > ms, they all look like good improvement. > > Rob >
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior wrote: > > On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote: > > Is there a memory usage issue for the systems that led to this thread? > > No, no memory issue led to this thread. I was just testing my patch and > I assumed that I did something wrong in the counting/lock drop/lock > acquire/allocate path because the array was hardly used. So I started to > look deeper… > Once I figured out everything was fine, I was curious if everyone is > aware of the different phandle creation by dtc vs POWER. And I posted > the mail in the thread. > Once you confirmed that everything is "known / not an issue" I was ready > to take off [0]. > > Later more replies came in such as one mail [1] from Rob describing the > original reason with 814 phandles. _Here_ I was just surprised that 1024 > were used over 64 entries for a benefit of 60ms. I understand that this > is low concern for you because that memory is released if modules are > not enabled. I usually see that module support is left enabled. > > However, Rob suggested / asked about the fixed size array (this is how I > understood it): > |And yes, as mentioned earlier I don't like the complexity. I didn't > |from the start and I'm I'm still of the opinion we should have a > |fixed or 1 time sized true cache (i.e. smaller than total # of > |phandles). That would solve the RT memory allocation and locking issue > |too. > > so I attempted to ask if we should have the fixed size array maybe > with the hash_32() instead the mask. This would make my other patch > obsolete because the fixed size array should not have a RT issue. The > hash_32() part here would address the POWER issue where the cache is > currently not used efficiently. > > If you want instead to keep things as-is then this is okay from my side. > If you want to keep this cache off on POWER then I could contribute a > patch doing so. It turns out there's actually a bug in the current implementation. If we have multiple phandles with the same mask, then we leak node references if we miss in the cache and re-assign the cache entry. Easily fixed I suppose, but holding a ref count for a cached entry seems wrong. That means we never have a ref count of 0 on every node with a phandle. I've done some more experiments with the performance. I've come to the conclusion that just measuring boot time is not a good way at least if the time is not a significant percentage of the total. I couldn't get any measurable data. I'm using a RK3399 Rock960 board. It has 190 phandles. I get about 1500 calls to of_find_node_by_phandle() during boot. Note that about the first 300 are before we have any timekeeping (the prior measurements also would not account for this). Those calls have no cache in the current implementation and are cached in my implementation. no cache: 20124 us current cache: 819 us new cache (64 entry): 4342 us new cache (128 entry): 2875 us new cache (256 entry): 1235 us To get some idea on the time spent before timekeeping is up, if we take the avg miss time is ~17us (20124/1200), then we're spending about ~5ms on lookups before the cache is enabled. I'd estimate the new cache takes ~400us before timekeeping is up as there's 11 misses early. >From these numbers, it seems the miss rate has a significant impact on performance for the different sizes. But taken from the original 20+ ms, they all look like good improvement. Rob
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote: > Is there a memory usage issue for the systems that led to this thread? No, no memory issue led to this thread. I was just testing my patch and I assumed that I did something wrong in the counting/lock drop/lock acquire/allocate path because the array was hardly used. So I started to look deeper… Once I figured out everything was fine, I was curious if everyone is aware of the different phandle creation by dtc vs POWER. And I posted the mail in the thread. Once you confirmed that everything is "known / not an issue" I was ready to take off [0]. Later more replies came in such as one mail [1] from Rob describing the original reason with 814 phandles. _Here_ I was just surprised that 1024 were used over 64 entries for a benefit of 60ms. I understand that this is low concern for you because that memory is released if modules are not enabled. I usually see that module support is left enabled. However, Rob suggested / asked about the fixed size array (this is how I understood it): |And yes, as mentioned earlier I don't like the complexity. I didn't |from the start and I'm I'm still of the opinion we should have a |fixed or 1 time sized true cache (i.e. smaller than total # of |phandles). That would solve the RT memory allocation and locking issue |too. so I attempted to ask if we should have the fixed size array maybe with the hash_32() instead the mask. This would make my other patch obsolete because the fixed size array should not have a RT issue. The hash_32() part here would address the POWER issue where the cache is currently not used efficiently. If you want instead to keep things as-is then this is okay from my side. If you want to keep this cache off on POWER then I could contribute a patch doing so. [0] https://lore.kernel.org/linux-devicetree/20191202110732.4dvzrro5o6zrl...@linutronix.de/ [1] https://lore.kernel.org/linux-devicetree/CAL_JsqKieG5=teL7gABPKbJOQfvoS9s-ZPF-=r0yee_luoy...@mail.gmail.com/ > -Frank Sebastian
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/5/19 7:52 PM, Frank Rowand wrote: > On 12/3/19 10:56 AM, Rob Herring wrote: >> On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand wrote: >>> >>> On 12/2/19 10:12 PM, Michael Ellerman wrote: Frank Rowand writes: > On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote: >> I've been looking at phandle_cache and noticed the following: The raw >> phandle value as generated by dtc starts at zero and is incremented by >> one for each phandle entry. The qemu pSeries model is using Slof (which >> is probably the same thing as used on real hardware) and this looks like >> a poiner value for the phandle. >> With >> qemu-system-ppc64le -m 16G -machine pseries -smp 8 >> >> I got the following output: >> | entries: 64 >> | phandle 7e732468 slot 28 hash c >> | phandle 7e732ad0 slot 10 hash 27 >> | phandle 7e732ee8 slot 28 hash 3a >> | phandle 7e734160 slot 20 hash 36 >> | phandle 7e734318 slot 18 hash 3a >> | phandle 7e734428 slot 28 hash 33 >> | phandle 7e734538 slot 38 hash 2c >> | phandle 7e734850 slot 10 hash e >> | phandle 7e735220 slot 20 hash 2d >> | phandle 7e735bf0 slot 30 hash d >> | phandle 7e7365c0 slot 0 hash 2d >> | phandle 7e736f90 slot 10 hash d >> | phandle 7e737960 slot 20 hash 2d >> | phandle 7e738330 slot 30 hash d >> | phandle 7e738d00 slot 0 hash 2d >> | phandle 7e739730 slot 30 hash 38 >> | phandle 7e73bd08 slot 8 hash 17 >> | phandle 7e73c2e0 slot 20 hash 32 >> | phandle 7e73c7f8 slot 38 hash 37 >> | phandle 7e782420 slot 20 hash 13 >> | phandle 7e782ed8 slot 18 hash 1b >> | phandle 7e73ce28 slot 28 hash 39 >> | phandle 7e73d390 slot 10 hash 22 >> | phandle 7e73d9a8 slot 28 hash 1a >> | phandle 7e73dc28 slot 28 hash 37 >> | phandle 7e73de00 slot 0 hash a >> | phandle 7e73e028 slot 28 hash 0 >> | phandle 7e7621a8 slot 28 hash 36 >> | phandle 7e73e458 slot 18 hash 1e >> | phandle 7e73e608 slot 8 hash 1e >> | phandle 7e740078 slot 38 hash 28 >> | phandle 7e740180 slot 0 hash 1d >> | phandle 7e740240 slot 0 hash 33 >> | phandle 7e740348 slot 8 hash 29 >> | phandle 7e740410 slot 10 hash 2 >> | phandle 7e740eb0 slot 30 hash 3e >> | phandle 7e745390 slot 10 hash 33 >> | phandle 7e747b08 slot 8 hash c >> | phandle 7e748528 slot 28 hash f >> | phandle 7e74a6e0 slot 20 hash 18 >> | phandle 7e74aab0 slot 30 hash b >> | phandle 7e74f788 slot 8 hash d >> | Used entries: 8, hashed: 29 >> >> So the hash array has 64 entries out which only 8 are populated. Using >> hash_32() populates 29 entries. >> Could someone with real hardware verify this? >> I'm not sure how important this performance wise, it looks just like a >> waste using only 1/8 of the array. > > The hash used is based on the assumptions you noted, and as stated in the > code, that phandle property values are in a contiguous range of 1..n > (not starting from zero), which is what dtc generates. > We knew that for systems that do not match the assumptions that the hash > will not be optimal. If we're going to have the phandle cache it should at least make some attempt to work across the systems that Linux supports. > Unless there is a serious performance problem for > such systems, I do not want to make the phandle hash code more complicated > to optimize for these cases. And the pseries have been performing ok > without phandle related performance issues that I remember hearing since > before the cache was added, which could have only helped the performance. > Yes, if your observations are correct, some memory is being wasted, but > a 64 entry cache is not very large on a pseries. A single line change to use an actual hash function is hardly complicating it, compared to the amount of code already there. >>> >>> With a dtc generated devicetree, the hash is perfect, with no >>> misses. That single line change then makes the hash bad for >>> dtc generated devicetrees. >> >> To quantify that, I did some tests with the hash algo and it's about a >> 10% collision rate using phandle values of 1-$cache_size. There's >> never more than 2 phandles colliding in a slot. The actual impact >> would be less given 100% thrashing seems unlikely. I also tested a few cache sizes. For cache size 64 entries, I get a 14% collision rate (and also 14% of entries unused). For 1024 entries, I get a 12% collision rate. And I can confirm no more than two phandles hashing to the same slot. So essentially confirming your data. -Frank > > Thank you for doing the tests. Actual data helps a lot. > > If there is only a 10% collision rate for this case, that does not > sound bad to me. There is the possibility of current or future > code resulting in ping ponging between two phandle values which > collide in the cache, but that should
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/6/19 5:40 PM, Segher Boessenkool wrote: > Hi, > > On Thu, Dec 05, 2019 at 07:37:24PM -0600, Frank Rowand wrote: >> On 12/3/19 12:35 PM, Segher Boessenkool wrote: >>> Btw. Some OFs mangle the phandles some way, to make it easier to catch >>> people using it as an address (and similarly, mangle ihandles differently, >>> so you catch confusion between ihandles and phandles as well). Like a >>> simple xor, with some odd number preferably. You should assume *nothing* >>> about phandles, they are opaque identifiers. >> >> For arm32 machines that use dtc to generate the devicetree, which is a >> very large user base, we certainly can make assumptions about phandles. > > I was talking about OF. Phandles are explicitly defined to be opaque > tokens. If there is an extra meaning to them in flattened device trees, > well, the kernel should then only depend on that there, not for more > general phandles. Where is this documented btw? And dtc generated devicetrees are a huge proportion of the OF systems. It is not documented. As an aside, overlays also depend upon the current dtc implementation. If an extremely large value is used for a phandle then overlay application will fail. > >> Especially because the complaints about the overhead of phandle based >> lookups have been voiced by users of this specific set of machines. >> >> For systems with a devicetree that does not follow the assumptions, the >> phandle cache should not measurably increase the overhead of phandle >> based lookups. > > It's an extra memory access and extra code to execute, for not much gain > (if anything). While with a reasonable hash function it will be good > for everyone. > >> If you have measurements of a system where implementing the phandle >> cache increased the overhead, > > Are you seriously saying you think this code can run in zero time and > space on most systems? No. I made no such claim. Note the additional words in the following sentences. >> and the additional overhead is a concern >> (such as significantly increasing boot time) then please share that >> information with us. Otherwise this is just a theoretical exercise. > > The point is that this code could be easily beneficial for most (or all) > users, not just those that use dtc-constructed device trees. It is The point is that the cache was implemented to solve a specific problem for certain specific systems. There had been a few reports of various machines having the same issue, but finally someone measures a **significant** improvement in boot time for a specific machine. The boot time with the cache was **measured** to be much shorter. The boot time for all systems with a dtc generated devicetree is expected to be faster. No one responded to the implementation when it was proposed with a **measurement** that showed increased boot time. A concern of using more memory was raised and discussed, with at least on feature added as a result (freeing the cache in late init if modules are not enabled). Being "beneficial for most (or all) users" has to be balanced against whether the change would remove the benefit for the system that the feature was originally implemented to solve a problem for. There was no performance data supplied to answer this question. (Though eventually Rob did some measurements of the impact on hash efficiency for such a system.) > completely obvious that having a worse cache hash function results in > many more lookups. Whether that results in something expressed as > milliseconds on tiny systems or microseconds on bigger systems is > completely beside the point. There was no performance data accompanying the proposed change that started this thread. There was no data showing whether the systems that this feature was created for would suffer. There was no data showing that the boot time of the pseries systems would improve. There was no assertion made that too much memory was being used by the cache (there was an implied assertion that a large percentage of the memory used for the cache was not used, thus the performance benefit of the cache could be improved by changing to using a hash instead of mask). We had rejected creating a cache for several years until finally some solid data was provided showing an actual need for it. It is not a question of "milliseconds on tiny systems or microseconds on bigger systems". I agree with that. But it does matter whether the performance impact of various implementations is large enough to either solve a problem or create a problem. On the other hand, if the amount of memory used by the cache is a problem (which is _not_ what was asserted by the patch submitter) then we can have a conversation about how to resolve that. -Frank > > > Segher >
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
Hi, On Thu, Dec 05, 2019 at 07:37:24PM -0600, Frank Rowand wrote: > On 12/3/19 12:35 PM, Segher Boessenkool wrote: > > Btw. Some OFs mangle the phandles some way, to make it easier to catch > > people using it as an address (and similarly, mangle ihandles differently, > > so you catch confusion between ihandles and phandles as well). Like a > > simple xor, with some odd number preferably. You should assume *nothing* > > about phandles, they are opaque identifiers. > > For arm32 machines that use dtc to generate the devicetree, which is a > very large user base, we certainly can make assumptions about phandles. I was talking about OF. Phandles are explicitly defined to be opaque tokens. If there is an extra meaning to them in flattened device trees, well, the kernel should then only depend on that there, not for more general phandles. Where is this documented btw? > Especially because the complaints about the overhead of phandle based > lookups have been voiced by users of this specific set of machines. > > For systems with a devicetree that does not follow the assumptions, the > phandle cache should not measurably increase the overhead of phandle > based lookups. It's an extra memory access and extra code to execute, for not much gain (if anything). While with a reasonable hash function it will be good for everyone. > If you have measurements of a system where implementing the phandle > cache increased the overhead, Are you seriously saying you think this code can run in zero time and space on most systems? > and the additional overhead is a concern > (such as significantly increasing boot time) then please share that > information with us. Otherwise this is just a theoretical exercise. The point is that this code could be easily beneficial for most (or all) users, not just those that use dtc-constructed device trees. It is completely obvious that having a worse cache hash function results in many more lookups. Whether that results in something expressed as milliseconds on tiny systems or microseconds on bigger systems is completely beside the point. Segher
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/5/19 10:35 AM, Sebastian Andrzej Siewior wrote: > On 2019-12-03 10:56:35 [-0600], Rob Herring wrote: >>> Another possibility would be to make the cache be dependent >>> upon not CONFIG_PPC. It might be possible to disable the >>> cache with a minimal code change. >> >> I'd rather not do that. >> >> And yes, as mentioned earlier I don't like the complexity. I didn't >> from the start and I'm I'm still of the opinion we should have a >> fixed or 1 time sized true cache (i.e. smaller than total # of >> phandles). That would solve the RT memory allocation and locking issue >> too. >> >> For reference, the performance difference between the current >> implementation (assuming fixes haven't regressed it) was ~400ms vs. a >> ~340ms improvement with a 64 entry cache (using a mask, not a hash). >> IMO, 340ms improvement was good enough. > > Okay. So the 814 phandles would result in an array with 1024 slots. That > would need 4KiB of memory. Is this amount of memory an issue for this system? If module support is not configured into the kernel then the cache is removed and memory freed in a late initcall. I don't know if that helps your use case or not. > What about we go back to the fix 64 slots array but with hash32 for the > lookup? Without the hash we would be 60ms slower during boot (compared > to now, based on ancient data) but then the hash isn't expensive so we > end up with better coverage of the memory on systems which don't have a > plain enumeration of the phandle. That performance data is specific to one particular system. It does not generalize to all devicetree based systems. So don't draw too many conclusions from it. If you want to understand the boot performance impact for your system, you need to measure the alternatives on your system. Is there a memory usage issue for the systems that led to this thread? Unless there is a documented memory issue, I do not want to expand an issue with poor cache bucket percent utilization to the other issue of cache size. -Frank > >> Rob > > Sebastian >
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/3/19 10:56 AM, Rob Herring wrote: > On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand wrote: >> >> On 12/2/19 10:12 PM, Michael Ellerman wrote: >>> Frank Rowand writes: On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote: > I've been looking at phandle_cache and noticed the following: The raw > phandle value as generated by dtc starts at zero and is incremented by > one for each phandle entry. The qemu pSeries model is using Slof (which > is probably the same thing as used on real hardware) and this looks like > a poiner value for the phandle. > With > qemu-system-ppc64le -m 16G -machine pseries -smp 8 > > I got the following output: > | entries: 64 > | phandle 7e732468 slot 28 hash c > | phandle 7e732ad0 slot 10 hash 27 > | phandle 7e732ee8 slot 28 hash 3a > | phandle 7e734160 slot 20 hash 36 > | phandle 7e734318 slot 18 hash 3a > | phandle 7e734428 slot 28 hash 33 > | phandle 7e734538 slot 38 hash 2c > | phandle 7e734850 slot 10 hash e > | phandle 7e735220 slot 20 hash 2d > | phandle 7e735bf0 slot 30 hash d > | phandle 7e7365c0 slot 0 hash 2d > | phandle 7e736f90 slot 10 hash d > | phandle 7e737960 slot 20 hash 2d > | phandle 7e738330 slot 30 hash d > | phandle 7e738d00 slot 0 hash 2d > | phandle 7e739730 slot 30 hash 38 > | phandle 7e73bd08 slot 8 hash 17 > | phandle 7e73c2e0 slot 20 hash 32 > | phandle 7e73c7f8 slot 38 hash 37 > | phandle 7e782420 slot 20 hash 13 > | phandle 7e782ed8 slot 18 hash 1b > | phandle 7e73ce28 slot 28 hash 39 > | phandle 7e73d390 slot 10 hash 22 > | phandle 7e73d9a8 slot 28 hash 1a > | phandle 7e73dc28 slot 28 hash 37 > | phandle 7e73de00 slot 0 hash a > | phandle 7e73e028 slot 28 hash 0 > | phandle 7e7621a8 slot 28 hash 36 > | phandle 7e73e458 slot 18 hash 1e > | phandle 7e73e608 slot 8 hash 1e > | phandle 7e740078 slot 38 hash 28 > | phandle 7e740180 slot 0 hash 1d > | phandle 7e740240 slot 0 hash 33 > | phandle 7e740348 slot 8 hash 29 > | phandle 7e740410 slot 10 hash 2 > | phandle 7e740eb0 slot 30 hash 3e > | phandle 7e745390 slot 10 hash 33 > | phandle 7e747b08 slot 8 hash c > | phandle 7e748528 slot 28 hash f > | phandle 7e74a6e0 slot 20 hash 18 > | phandle 7e74aab0 slot 30 hash b > | phandle 7e74f788 slot 8 hash d > | Used entries: 8, hashed: 29 > > So the hash array has 64 entries out which only 8 are populated. Using > hash_32() populates 29 entries. > Could someone with real hardware verify this? > I'm not sure how important this performance wise, it looks just like a > waste using only 1/8 of the array. The hash used is based on the assumptions you noted, and as stated in the code, that phandle property values are in a contiguous range of 1..n (not starting from zero), which is what dtc generates. >>> We knew that for systems that do not match the assumptions that the hash will not be optimal. >>> >>> If we're going to have the phandle cache it should at least make some >>> attempt to work across the systems that Linux supports. >>> Unless there is a serious performance problem for such systems, I do not want to make the phandle hash code more complicated to optimize for these cases. And the pseries have been performing ok without phandle related performance issues that I remember hearing since before the cache was added, which could have only helped the performance. Yes, if your observations are correct, some memory is being wasted, but a 64 entry cache is not very large on a pseries. >>> >>> A single line change to use an actual hash function is hardly >>> complicating it, compared to the amount of code already there. >> >> With a dtc generated devicetree, the hash is perfect, with no >> misses. That single line change then makes the hash bad for >> dtc generated devicetrees. > > To quantify that, I did some tests with the hash algo and it's about a > 10% collision rate using phandle values of 1-$cache_size. There's > never more than 2 phandles colliding in a slot. The actual impact > would be less given 100% thrashing seems unlikely. Thank you for doing the tests. Actual data helps a lot. If there is only a 10% collision rate for this case, that does not sound bad to me. There is the possibility of current or future code resulting in ping ponging between two phandle values which collide in the cache, but that should not be the common case. However, given a choice between two algorithms, one of which guarantees no thrashing (the current cache algorithm) and one which allows a pathologic use case which results in thrashing, I prefer the first algorithm. This may seem theoretical, but I have seen enough pathological code paths in my years of performance analysis and tuning to be sensitive to this issue. > >> The cache was added to address a
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/3/19 12:35 PM, Segher Boessenkool wrote: > Hi! > > On Tue, Dec 03, 2019 at 03:03:22PM +1100, Michael Ellerman wrote: >> Sebastian Andrzej Siewior writes: >> I've certainly heard it said that on some OF's the phandle was just == >> the address of the internal representation, and I guess maybe for SLOF >> that is true. > > It is (or was). In many OFs it is just the effective address of some > node structure. SLOF runs with translation off normally. > >> They seem to vary wildly though, eg. on an Apple G5: > > Apple OF runs with translation on usually. IIRC these are effective > addresses as well. > > The OF they have on G5 machines is mostly 32-bit, for compatibility is my > guess (for userland things dealing with addresses from OF, importantly). > >> $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10 >> /proc/device-tree/vsp@0,f900/veo@f918/phandle ff970848 >> /proc/device-tree/vsp@0,f900/phandle ff970360 >> /proc/device-tree/vsp@0,f900/veo@f908/phandle ff970730 >> /proc/device-tree/nvram@0,fff04000/phandle ff967fb8 >> /proc/device-tree/xmodem/phandle ff9655e8 >> /proc/device-tree/multiboot/phandle ff9504f0 >> /proc/device-tree/diagnostics/phandle ff965550 >> /proc/device-tree/options/phandle ff893cf0 >> /proc/device-tree/openprom/client-services/phandle ff8925b8 >> /proc/device-tree/openprom/phandle ff892458 >> >> That machine does not have enough RAM for those to be 32-bit real >> addresses. I think Apple OF is running in virtual mode though (?), so >> maybe they are pointers? > > Yes, I think the default is to have 8MB ram at the top of 4GB (which is > the physical address of the bootrom, btw) for OF. > >> And on an IBM pseries machine they're a bit all over the place: >> >> /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 1040 >> /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 2005 >> /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 1030 >> /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 1020 >> /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 1010 >> /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 2003 >> /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 200a >> /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 3108 >> /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 2001 >> /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 3106 >> /proc/device-tree/cpus/ibm,phandle fff8 >> /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 3104 >> /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 2008 >> /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 3102 >> /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 2006 >> /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 3100 >> /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 1008 >> /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 2004 >> /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 1048 >> /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 1038 >> /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 2002 >> /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 1028 >> /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 3107 >> /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 1018 >> /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 2000 >> /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 3105 >> /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 3103 >> /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 310a >> /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 1000 >> /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 2007 >> /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 3101 >> /proc/device-tree/pci@800201b/ibm,phandle 201b > > Some (the 1000) look like addresses as well. > >>> So the hash array has 64 entries out which only 8 are populated. Using >>> hash_32() populates 29 entries. > >> On the G5 it's similarly inefficient: >> [0.007379] OF: of_populate_phandle_cache(242) Used entries: 31, hashed: >> 111 > >> And some output from a "real" pseries machine (IBM OF), which is >> slightly better: >> [0.129467] OF: of_populate_phandle_cache(242) Used entries: 39, hashed: >> 81 > >> So yeah using hash_32() is quite a bit better in both cases. > > Yup, no surprise there. And hash_32 is very cheap to compute. > >> And if I'm reading your patch right it would be a single line change to>> >> switch, so that seems like it's worth doing to me. > > Agreed! > > Btw. Some OFs mangle the phandles some way, to make it easier to catch > people using it as an address (and similarly, mangle ihandles differently, > so you catch confusion between ihandles and phandles as well). Like a > simple xor, with some odd number preferably. You should assume *nothing* > about phandles, they are opaque identifiers. For arm32 machines that use dtc to generate the devicetree, which is
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 2019-12-03 10:56:35 [-0600], Rob Herring wrote: > > Another possibility would be to make the cache be dependent > > upon not CONFIG_PPC. It might be possible to disable the > > cache with a minimal code change. > > I'd rather not do that. > > And yes, as mentioned earlier I don't like the complexity. I didn't > from the start and I'm I'm still of the opinion we should have a > fixed or 1 time sized true cache (i.e. smaller than total # of > phandles). That would solve the RT memory allocation and locking issue > too. > > For reference, the performance difference between the current > implementation (assuming fixes haven't regressed it) was ~400ms vs. a > ~340ms improvement with a 64 entry cache (using a mask, not a hash). > IMO, 340ms improvement was good enough. Okay. So the 814 phandles would result in an array with 1024 slots. That would need 4KiB of memory. What about we go back to the fix 64 slots array but with hash32 for the lookup? Without the hash we would be 60ms slower during boot (compared to now, based on ancient data) but then the hash isn't expensive so we end up with better coverage of the memory on systems which don't have a plain enumeration of the phandle. > Rob Sebastian
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
Hi! On Tue, Dec 03, 2019 at 03:03:22PM +1100, Michael Ellerman wrote: > Sebastian Andrzej Siewior writes: > I've certainly heard it said that on some OF's the phandle was just == > the address of the internal representation, and I guess maybe for SLOF > that is true. It is (or was). In many OFs it is just the effective address of some node structure. SLOF runs with translation off normally. > They seem to vary wildly though, eg. on an Apple G5: Apple OF runs with translation on usually. IIRC these are effective addresses as well. The OF they have on G5 machines is mostly 32-bit, for compatibility is my guess (for userland things dealing with addresses from OF, importantly). > $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10 > /proc/device-tree/vsp@0,f900/veo@f918/phandle ff970848 > /proc/device-tree/vsp@0,f900/phandle ff970360 > /proc/device-tree/vsp@0,f900/veo@f908/phandle ff970730 > /proc/device-tree/nvram@0,fff04000/phandle ff967fb8 > /proc/device-tree/xmodem/phandle ff9655e8 > /proc/device-tree/multiboot/phandle ff9504f0 > /proc/device-tree/diagnostics/phandle ff965550 > /proc/device-tree/options/phandle ff893cf0 > /proc/device-tree/openprom/client-services/phandle ff8925b8 > /proc/device-tree/openprom/phandle ff892458 > > That machine does not have enough RAM for those to be 32-bit real > addresses. I think Apple OF is running in virtual mode though (?), so > maybe they are pointers? Yes, I think the default is to have 8MB ram at the top of 4GB (which is the physical address of the bootrom, btw) for OF. > And on an IBM pseries machine they're a bit all over the place: > > /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 1040 > /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 2005 > /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 1030 > /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 1020 > /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 1010 > /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 2003 > /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 200a > /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 3108 > /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 2001 > /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 3106 > /proc/device-tree/cpus/ibm,phandle fff8 > /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 3104 > /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 2008 > /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 3102 > /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 2006 > /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 3100 > /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 1008 > /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 2004 > /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 1048 > /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 1038 > /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 2002 > /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 1028 > /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 3107 > /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 1018 > /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 2000 > /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 3105 > /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 3103 > /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 310a > /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 1000 > /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 2007 > /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 3101 > /proc/device-tree/pci@800201b/ibm,phandle 201b Some (the 1000) look like addresses as well. > > So the hash array has 64 entries out which only 8 are populated. Using > > hash_32() populates 29 entries. > On the G5 it's similarly inefficient: > [0.007379] OF: of_populate_phandle_cache(242) Used entries: 31, hashed: > 111 > And some output from a "real" pseries machine (IBM OF), which is > slightly better: > [0.129467] OF: of_populate_phandle_cache(242) Used entries: 39, hashed: 81 > So yeah using hash_32() is quite a bit better in both cases. Yup, no surprise there. And hash_32 is very cheap to compute. > And if I'm reading your patch right it would be a single line change to > switch, so that seems like it's worth doing to me. Agreed! Btw. Some OFs mangle the phandles some way, to make it easier to catch people using it as an address (and similarly, mangle ihandles differently, so you catch confusion between ihandles and phandles as well). Like a simple xor, with some odd number preferably. You should assume *nothing* about phandles, they are opaque identifiers. Segher
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand wrote: > > On 12/2/19 10:12 PM, Michael Ellerman wrote: > > Frank Rowand writes: > >> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote: > >>> I've been looking at phandle_cache and noticed the following: The raw > >>> phandle value as generated by dtc starts at zero and is incremented by > >>> one for each phandle entry. The qemu pSeries model is using Slof (which > >>> is probably the same thing as used on real hardware) and this looks like > >>> a poiner value for the phandle. > >>> With > >>> qemu-system-ppc64le -m 16G -machine pseries -smp 8 > >>> > >>> I got the following output: > >>> | entries: 64 > >>> | phandle 7e732468 slot 28 hash c > >>> | phandle 7e732ad0 slot 10 hash 27 > >>> | phandle 7e732ee8 slot 28 hash 3a > >>> | phandle 7e734160 slot 20 hash 36 > >>> | phandle 7e734318 slot 18 hash 3a > >>> | phandle 7e734428 slot 28 hash 33 > >>> | phandle 7e734538 slot 38 hash 2c > >>> | phandle 7e734850 slot 10 hash e > >>> | phandle 7e735220 slot 20 hash 2d > >>> | phandle 7e735bf0 slot 30 hash d > >>> | phandle 7e7365c0 slot 0 hash 2d > >>> | phandle 7e736f90 slot 10 hash d > >>> | phandle 7e737960 slot 20 hash 2d > >>> | phandle 7e738330 slot 30 hash d > >>> | phandle 7e738d00 slot 0 hash 2d > >>> | phandle 7e739730 slot 30 hash 38 > >>> | phandle 7e73bd08 slot 8 hash 17 > >>> | phandle 7e73c2e0 slot 20 hash 32 > >>> | phandle 7e73c7f8 slot 38 hash 37 > >>> | phandle 7e782420 slot 20 hash 13 > >>> | phandle 7e782ed8 slot 18 hash 1b > >>> | phandle 7e73ce28 slot 28 hash 39 > >>> | phandle 7e73d390 slot 10 hash 22 > >>> | phandle 7e73d9a8 slot 28 hash 1a > >>> | phandle 7e73dc28 slot 28 hash 37 > >>> | phandle 7e73de00 slot 0 hash a > >>> | phandle 7e73e028 slot 28 hash 0 > >>> | phandle 7e7621a8 slot 28 hash 36 > >>> | phandle 7e73e458 slot 18 hash 1e > >>> | phandle 7e73e608 slot 8 hash 1e > >>> | phandle 7e740078 slot 38 hash 28 > >>> | phandle 7e740180 slot 0 hash 1d > >>> | phandle 7e740240 slot 0 hash 33 > >>> | phandle 7e740348 slot 8 hash 29 > >>> | phandle 7e740410 slot 10 hash 2 > >>> | phandle 7e740eb0 slot 30 hash 3e > >>> | phandle 7e745390 slot 10 hash 33 > >>> | phandle 7e747b08 slot 8 hash c > >>> | phandle 7e748528 slot 28 hash f > >>> | phandle 7e74a6e0 slot 20 hash 18 > >>> | phandle 7e74aab0 slot 30 hash b > >>> | phandle 7e74f788 slot 8 hash d > >>> | Used entries: 8, hashed: 29 > >>> > >>> So the hash array has 64 entries out which only 8 are populated. Using > >>> hash_32() populates 29 entries. > >>> Could someone with real hardware verify this? > >>> I'm not sure how important this performance wise, it looks just like a > >>> waste using only 1/8 of the array. > >> > >> The hash used is based on the assumptions you noted, and as stated in the > >> code, that phandle property values are in a contiguous range of 1..n > >> (not starting from zero), which is what dtc generates. > > > >> We knew that for systems that do not match the assumptions that the hash > >> will not be optimal. > > > > If we're going to have the phandle cache it should at least make some > > attempt to work across the systems that Linux supports. > > > >> Unless there is a serious performance problem for > >> such systems, I do not want to make the phandle hash code more complicated > >> to optimize for these cases. And the pseries have been performing ok > >> without phandle related performance issues that I remember hearing since > >> before the cache was added, which could have only helped the performance. > >> Yes, if your observations are correct, some memory is being wasted, but > >> a 64 entry cache is not very large on a pseries. > > > > A single line change to use an actual hash function is hardly > > complicating it, compared to the amount of code already there. > > With a dtc generated devicetree, the hash is perfect, with no > misses. That single line change then makes the hash bad for > dtc generated devicetrees. To quantify that, I did some tests with the hash algo and it's about a 10% collision rate using phandle values of 1-$cache_size. There's never more than 2 phandles colliding in a slot. The actual impact would be less given 100% thrashing seems unlikely. > The cache was added to address a problem with a system with a > dtc generated devicetree. The cache was added to address a problem with a system with a large number of nodes and phandles (814 phandles in the reported case). dtc happens to be used in that case. > I had not heard of any phandle > lookup performance issues on ppc systems. An imperfect hash > for ppc should not make the ppc performance worse (unless > every single phandle value hashes to a single bucket). So > the ppc performance is likely better than it was before > the hash was added, even without an optimal hash algorithm. > > So the change would not be a single line change. It would > be a change to use different hash algorithms for dtc > generated device trees versus other device trees. > > Another
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 12/2/19 10:12 PM, Michael Ellerman wrote: > Frank Rowand writes: >> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote: >>> I've been looking at phandle_cache and noticed the following: The raw >>> phandle value as generated by dtc starts at zero and is incremented by >>> one for each phandle entry. The qemu pSeries model is using Slof (which >>> is probably the same thing as used on real hardware) and this looks like >>> a poiner value for the phandle. >>> With >>> qemu-system-ppc64le -m 16G -machine pseries -smp 8 >>> >>> I got the following output: >>> | entries: 64 >>> | phandle 7e732468 slot 28 hash c >>> | phandle 7e732ad0 slot 10 hash 27 >>> | phandle 7e732ee8 slot 28 hash 3a >>> | phandle 7e734160 slot 20 hash 36 >>> | phandle 7e734318 slot 18 hash 3a >>> | phandle 7e734428 slot 28 hash 33 >>> | phandle 7e734538 slot 38 hash 2c >>> | phandle 7e734850 slot 10 hash e >>> | phandle 7e735220 slot 20 hash 2d >>> | phandle 7e735bf0 slot 30 hash d >>> | phandle 7e7365c0 slot 0 hash 2d >>> | phandle 7e736f90 slot 10 hash d >>> | phandle 7e737960 slot 20 hash 2d >>> | phandle 7e738330 slot 30 hash d >>> | phandle 7e738d00 slot 0 hash 2d >>> | phandle 7e739730 slot 30 hash 38 >>> | phandle 7e73bd08 slot 8 hash 17 >>> | phandle 7e73c2e0 slot 20 hash 32 >>> | phandle 7e73c7f8 slot 38 hash 37 >>> | phandle 7e782420 slot 20 hash 13 >>> | phandle 7e782ed8 slot 18 hash 1b >>> | phandle 7e73ce28 slot 28 hash 39 >>> | phandle 7e73d390 slot 10 hash 22 >>> | phandle 7e73d9a8 slot 28 hash 1a >>> | phandle 7e73dc28 slot 28 hash 37 >>> | phandle 7e73de00 slot 0 hash a >>> | phandle 7e73e028 slot 28 hash 0 >>> | phandle 7e7621a8 slot 28 hash 36 >>> | phandle 7e73e458 slot 18 hash 1e >>> | phandle 7e73e608 slot 8 hash 1e >>> | phandle 7e740078 slot 38 hash 28 >>> | phandle 7e740180 slot 0 hash 1d >>> | phandle 7e740240 slot 0 hash 33 >>> | phandle 7e740348 slot 8 hash 29 >>> | phandle 7e740410 slot 10 hash 2 >>> | phandle 7e740eb0 slot 30 hash 3e >>> | phandle 7e745390 slot 10 hash 33 >>> | phandle 7e747b08 slot 8 hash c >>> | phandle 7e748528 slot 28 hash f >>> | phandle 7e74a6e0 slot 20 hash 18 >>> | phandle 7e74aab0 slot 30 hash b >>> | phandle 7e74f788 slot 8 hash d >>> | Used entries: 8, hashed: 29 >>> >>> So the hash array has 64 entries out which only 8 are populated. Using >>> hash_32() populates 29 entries. >>> Could someone with real hardware verify this? >>> I'm not sure how important this performance wise, it looks just like a >>> waste using only 1/8 of the array. >> >> The hash used is based on the assumptions you noted, and as stated in the >> code, that phandle property values are in a contiguous range of 1..n >> (not starting from zero), which is what dtc generates. > >> We knew that for systems that do not match the assumptions that the hash >> will not be optimal. > > If we're going to have the phandle cache it should at least make some > attempt to work across the systems that Linux supports. > >> Unless there is a serious performance problem for >> such systems, I do not want to make the phandle hash code more complicated >> to optimize for these cases. And the pseries have been performing ok >> without phandle related performance issues that I remember hearing since >> before the cache was added, which could have only helped the performance. >> Yes, if your observations are correct, some memory is being wasted, but >> a 64 entry cache is not very large on a pseries. > > A single line change to use an actual hash function is hardly > complicating it, compared to the amount of code already there. With a dtc generated devicetree, the hash is perfect, with no misses. That single line change then makes the hash bad for dtc generated devicetrees. The cache was added to address a problem with a system with a dtc generated devicetree. I had not heard of any phandle lookup performance issues on ppc systems. An imperfect hash for ppc should not make the ppc performance worse (unless every single phandle value hashes to a single bucket). So the ppc performance is likely better than it was before the hash was added, even without an optimal hash algorithm. So the change would not be a single line change. It would be a change to use different hash algorithms for dtc generated device trees versus other device trees. Another possibility would be to make the cache be dependent upon not CONFIG_PPC. It might be possible to disable the cache with a minimal code change. > > cheers >
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
Frank Rowand writes: > On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote: >> I've been looking at phandle_cache and noticed the following: The raw >> phandle value as generated by dtc starts at zero and is incremented by >> one for each phandle entry. The qemu pSeries model is using Slof (which >> is probably the same thing as used on real hardware) and this looks like >> a poiner value for the phandle. >> With >> qemu-system-ppc64le -m 16G -machine pseries -smp 8 >> >> I got the following output: >> | entries: 64 >> | phandle 7e732468 slot 28 hash c >> | phandle 7e732ad0 slot 10 hash 27 >> | phandle 7e732ee8 slot 28 hash 3a >> | phandle 7e734160 slot 20 hash 36 >> | phandle 7e734318 slot 18 hash 3a >> | phandle 7e734428 slot 28 hash 33 >> | phandle 7e734538 slot 38 hash 2c >> | phandle 7e734850 slot 10 hash e >> | phandle 7e735220 slot 20 hash 2d >> | phandle 7e735bf0 slot 30 hash d >> | phandle 7e7365c0 slot 0 hash 2d >> | phandle 7e736f90 slot 10 hash d >> | phandle 7e737960 slot 20 hash 2d >> | phandle 7e738330 slot 30 hash d >> | phandle 7e738d00 slot 0 hash 2d >> | phandle 7e739730 slot 30 hash 38 >> | phandle 7e73bd08 slot 8 hash 17 >> | phandle 7e73c2e0 slot 20 hash 32 >> | phandle 7e73c7f8 slot 38 hash 37 >> | phandle 7e782420 slot 20 hash 13 >> | phandle 7e782ed8 slot 18 hash 1b >> | phandle 7e73ce28 slot 28 hash 39 >> | phandle 7e73d390 slot 10 hash 22 >> | phandle 7e73d9a8 slot 28 hash 1a >> | phandle 7e73dc28 slot 28 hash 37 >> | phandle 7e73de00 slot 0 hash a >> | phandle 7e73e028 slot 28 hash 0 >> | phandle 7e7621a8 slot 28 hash 36 >> | phandle 7e73e458 slot 18 hash 1e >> | phandle 7e73e608 slot 8 hash 1e >> | phandle 7e740078 slot 38 hash 28 >> | phandle 7e740180 slot 0 hash 1d >> | phandle 7e740240 slot 0 hash 33 >> | phandle 7e740348 slot 8 hash 29 >> | phandle 7e740410 slot 10 hash 2 >> | phandle 7e740eb0 slot 30 hash 3e >> | phandle 7e745390 slot 10 hash 33 >> | phandle 7e747b08 slot 8 hash c >> | phandle 7e748528 slot 28 hash f >> | phandle 7e74a6e0 slot 20 hash 18 >> | phandle 7e74aab0 slot 30 hash b >> | phandle 7e74f788 slot 8 hash d >> | Used entries: 8, hashed: 29 >> >> So the hash array has 64 entries out which only 8 are populated. Using >> hash_32() populates 29 entries. >> Could someone with real hardware verify this? >> I'm not sure how important this performance wise, it looks just like a >> waste using only 1/8 of the array. > > The hash used is based on the assumptions you noted, and as stated in the > code, that phandle property values are in a contiguous range of 1..n > (not starting from zero), which is what dtc generates. > We knew that for systems that do not match the assumptions that the hash > will not be optimal. If we're going to have the phandle cache it should at least make some attempt to work across the systems that Linux supports. > Unless there is a serious performance problem for > such systems, I do not want to make the phandle hash code more complicated > to optimize for these cases. And the pseries have been performing ok > without phandle related performance issues that I remember hearing since > before the cache was added, which could have only helped the performance. > Yes, if your observations are correct, some memory is being wasted, but > a 64 entry cache is not very large on a pseries. A single line change to use an actual hash function is hardly complicating it, compared to the amount of code already there. cheers
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
Sebastian Andrzej Siewior writes: > I've been looking at phandle_cache and noticed the following: The raw > phandle value as generated by dtc starts at zero and is incremented by > one for each phandle entry. The qemu pSeries model is using Slof (which > is probably the same thing as used on real hardware) and this looks like > a poiner value for the phandle. We don't use SLOF on bare metal these days. I've certainly heard it said that on some OF's the phandle was just == the address of the internal representation, and I guess maybe for SLOF that is true. They seem to vary wildly though, eg. on an Apple G5: $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10 /proc/device-tree/vsp@0,f900/veo@f918/phandle ff970848 /proc/device-tree/vsp@0,f900/phandle ff970360 /proc/device-tree/vsp@0,f900/veo@f908/phandle ff970730 /proc/device-tree/nvram@0,fff04000/phandle ff967fb8 /proc/device-tree/xmodem/phandle ff9655e8 /proc/device-tree/multiboot/phandle ff9504f0 /proc/device-tree/diagnostics/phandle ff965550 /proc/device-tree/options/phandle ff893cf0 /proc/device-tree/openprom/client-services/phandle ff8925b8 /proc/device-tree/openprom/phandle ff892458 That machine does not have enough RAM for those to be 32-bit real addresses. I think Apple OF is running in virtual mode though (?), so maybe they are pointers? And on an IBM pseries machine they're a bit all over the place: /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 1040 /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 2005 /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 1030 /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 1020 /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 1010 /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 2003 /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 200a /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 3108 /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 2001 /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 3106 /proc/device-tree/cpus/ibm,phandle fff8 /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 3104 /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 2008 /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 3102 /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 2006 /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 3100 /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 1008 /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 2004 /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 1048 /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 1038 /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 2002 /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 1028 /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 3107 /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 1018 /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 2000 /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 3105 /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 3103 /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 310a /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 1000 /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 2007 /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 3101 /proc/device-tree/pci@800201b/ibm,phandle 201b > With > qemu-system-ppc64le -m 16G -machine pseries -smp 8 > > I got the following output: > | entries: 64 > | phandle 7e732468 slot 28 hash c > | phandle 7e732ad0 slot 10 hash 27 > | phandle 7e732ee8 slot 28 hash 3a > | phandle 7e734160 slot 20 hash 36 > | phandle 7e734318 slot 18 hash 3a > | phandle 7e734428 slot 28 hash 33 > | phandle 7e734538 slot 38 hash 2c > | phandle 7e734850 slot 10 hash e > | phandle 7e735220 slot 20 hash 2d > | phandle 7e735bf0 slot 30 hash d > | phandle 7e7365c0 slot 0 hash 2d > | phandle 7e736f90 slot 10 hash d > | phandle 7e737960 slot 20 hash 2d > | phandle 7e738330 slot 30 hash d > | phandle 7e738d00 slot 0 hash 2d > | phandle 7e739730 slot 30 hash 38 > | phandle 7e73bd08 slot 8 hash 17 > | phandle 7e73c2e0 slot 20 hash 32 > | phandle 7e73c7f8 slot 38 hash 37 > | phandle 7e782420 slot 20 hash 13 > | phandle 7e782ed8 slot 18 hash 1b > | phandle 7e73ce28 slot 28 hash 39 > | phandle 7e73d390 slot 10 hash 22 > | phandle 7e73d9a8 slot 28 hash 1a > | phandle 7e73dc28 slot 28 hash 37 > | phandle 7e73de00 slot 0 hash a > | phandle 7e73e028 slot 28 hash 0 > | phandle 7e7621a8 slot 28 hash 36 > | phandle 7e73e458 slot 18 hash 1e > | phandle 7e73e608 slot 8 hash 1e > | phandle 7e740078 slot 38 hash 28 > | phandle 7e740180 slot 0 hash 1d > | phandle 7e740240 slot 0 hash 33 > | phandle 7e740348 slot 8 hash 29 > | phandle 7e740410 slot 10 hash 2 > | phandle 7e740eb0 slot 30 hash 3e > | phandle 7e745390 slot 10 hash 33 > | phandle 7e747b08 slot 8 hash c > | phandle 7e748528 slot 28 hash f > | phandle 7e74a6e0 slot 20 hash 18 > | phandle 7e74aab0 slot 30
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 2019-11-29 20:14:47 [-0600], Frank Rowand wrote: > The hash used is based on the assumptions you noted, and as stated in the > code, that phandle property values are in a contiguous range of 1..n > (not starting from zero), which is what dtc generates. > > We knew that for systems that do not match the assumptions that the hash > will not be optimal. Unless there is a serious performance problem for > such systems, I do not want to make the phandle hash code more complicated > to optimize for these cases. And the pseries have been performing ok > without phandle related performance issues that I remember hearing since > before the cache was added, which could have only helped the performance. > Yes, if your observations are correct, some memory is being wasted, but > a 64 entry cache is not very large on a pseries. okay, so it is nothing new and everyone is aware of the situation. I move on then :) > -Frank Sebastian
Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote: > I've been looking at phandle_cache and noticed the following: The raw > phandle value as generated by dtc starts at zero and is incremented by > one for each phandle entry. The qemu pSeries model is using Slof (which > is probably the same thing as used on real hardware) and this looks like > a poiner value for the phandle. > With > qemu-system-ppc64le -m 16G -machine pseries -smp 8 > > I got the following output: > | entries: 64 > | phandle 7e732468 slot 28 hash c > | phandle 7e732ad0 slot 10 hash 27 > | phandle 7e732ee8 slot 28 hash 3a > | phandle 7e734160 slot 20 hash 36 > | phandle 7e734318 slot 18 hash 3a > | phandle 7e734428 slot 28 hash 33 > | phandle 7e734538 slot 38 hash 2c > | phandle 7e734850 slot 10 hash e > | phandle 7e735220 slot 20 hash 2d > | phandle 7e735bf0 slot 30 hash d > | phandle 7e7365c0 slot 0 hash 2d > | phandle 7e736f90 slot 10 hash d > | phandle 7e737960 slot 20 hash 2d > | phandle 7e738330 slot 30 hash d > | phandle 7e738d00 slot 0 hash 2d > | phandle 7e739730 slot 30 hash 38 > | phandle 7e73bd08 slot 8 hash 17 > | phandle 7e73c2e0 slot 20 hash 32 > | phandle 7e73c7f8 slot 38 hash 37 > | phandle 7e782420 slot 20 hash 13 > | phandle 7e782ed8 slot 18 hash 1b > | phandle 7e73ce28 slot 28 hash 39 > | phandle 7e73d390 slot 10 hash 22 > | phandle 7e73d9a8 slot 28 hash 1a > | phandle 7e73dc28 slot 28 hash 37 > | phandle 7e73de00 slot 0 hash a > | phandle 7e73e028 slot 28 hash 0 > | phandle 7e7621a8 slot 28 hash 36 > | phandle 7e73e458 slot 18 hash 1e > | phandle 7e73e608 slot 8 hash 1e > | phandle 7e740078 slot 38 hash 28 > | phandle 7e740180 slot 0 hash 1d > | phandle 7e740240 slot 0 hash 33 > | phandle 7e740348 slot 8 hash 29 > | phandle 7e740410 slot 10 hash 2 > | phandle 7e740eb0 slot 30 hash 3e > | phandle 7e745390 slot 10 hash 33 > | phandle 7e747b08 slot 8 hash c > | phandle 7e748528 slot 28 hash f > | phandle 7e74a6e0 slot 20 hash 18 > | phandle 7e74aab0 slot 30 hash b > | phandle 7e74f788 slot 8 hash d > | Used entries: 8, hashed: 29 > > So the hash array has 64 entries out which only 8 are populated. Using > hash_32() populates 29 entries. > Could someone with real hardware verify this? > I'm not sure how important this performance wise, it looks just like a > waste using only 1/8 of the array. The hash used is based on the assumptions you noted, and as stated in the code, that phandle property values are in a contiguous range of 1..n (not starting from zero), which is what dtc generates. We knew that for systems that do not match the assumptions that the hash will not be optimal. Unless there is a serious performance problem for such systems, I do not want to make the phandle hash code more complicated to optimize for these cases. And the pseries have been performing ok without phandle related performance issues that I remember hearing since before the cache was added, which could have only helped the performance. Yes, if your observations are correct, some memory is being wasted, but a 64 entry cache is not very large on a pseries. There is already some push back from Rob that the existing code is more complex than needed (eg variable cache size). -Frank > > The patch used for testing: > > diff --git a/drivers/of/base.c b/drivers/of/base.c > index 1d667eb730e19..2640d4bc81a9a 100644 > --- a/drivers/of/base.c > +++ b/drivers/of/base.c > @@ -197,6 +197,7 @@ void of_populate_phandle_cache(void) > u32 cache_entries; > struct device_node *np; > u32 phandles = 0; > + struct device_node **cache2; > > raw_spin_lock_irqsave(_lock, flags); > > @@ -214,14 +215,32 @@ void of_populate_phandle_cache(void) > > phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache), > GFP_ATOMIC); > + cache2 = kcalloc(cache_entries, sizeof(*phandle_cache), GFP_ATOMIC); > if (!phandle_cache) > goto out; > > + pr_err("%s(%d) entries: %d\n", __func__, __LINE__, cache_entries); > for_each_of_allnodes(np) > if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) { > + int slot; > of_node_get(np); > phandle_cache[np->phandle & phandle_cache_mask] = np; > + slot = hash_32(np->phandle, __ffs(cache_entries)); > + cache2[slot] = np; > + pr_err("%s(%d) phandle %x slot %x hash %x\n", __func__, > __LINE__, > +np->phandle, np->phandle & phandle_cache_mask, > slot); > } > + { > + int i, filled = 0, filled_hash = 0; > + > + for (i = 0; i < cache_entries; i++) { > + if (phandle_cache[i]) > + filled++; > + if (cache2[i]) > + filled_hash++; > + } > + pr_err("%s(%d) Used entries: %d, hashed: %d\n", __func__, >
[RFC] Efficiency of the phandle_cache on ppc64/SLOF
I've been looking at phandle_cache and noticed the following: The raw phandle value as generated by dtc starts at zero and is incremented by one for each phandle entry. The qemu pSeries model is using Slof (which is probably the same thing as used on real hardware) and this looks like a poiner value for the phandle. With qemu-system-ppc64le -m 16G -machine pseries -smp 8 I got the following output: | entries: 64 | phandle 7e732468 slot 28 hash c | phandle 7e732ad0 slot 10 hash 27 | phandle 7e732ee8 slot 28 hash 3a | phandle 7e734160 slot 20 hash 36 | phandle 7e734318 slot 18 hash 3a | phandle 7e734428 slot 28 hash 33 | phandle 7e734538 slot 38 hash 2c | phandle 7e734850 slot 10 hash e | phandle 7e735220 slot 20 hash 2d | phandle 7e735bf0 slot 30 hash d | phandle 7e7365c0 slot 0 hash 2d | phandle 7e736f90 slot 10 hash d | phandle 7e737960 slot 20 hash 2d | phandle 7e738330 slot 30 hash d | phandle 7e738d00 slot 0 hash 2d | phandle 7e739730 slot 30 hash 38 | phandle 7e73bd08 slot 8 hash 17 | phandle 7e73c2e0 slot 20 hash 32 | phandle 7e73c7f8 slot 38 hash 37 | phandle 7e782420 slot 20 hash 13 | phandle 7e782ed8 slot 18 hash 1b | phandle 7e73ce28 slot 28 hash 39 | phandle 7e73d390 slot 10 hash 22 | phandle 7e73d9a8 slot 28 hash 1a | phandle 7e73dc28 slot 28 hash 37 | phandle 7e73de00 slot 0 hash a | phandle 7e73e028 slot 28 hash 0 | phandle 7e7621a8 slot 28 hash 36 | phandle 7e73e458 slot 18 hash 1e | phandle 7e73e608 slot 8 hash 1e | phandle 7e740078 slot 38 hash 28 | phandle 7e740180 slot 0 hash 1d | phandle 7e740240 slot 0 hash 33 | phandle 7e740348 slot 8 hash 29 | phandle 7e740410 slot 10 hash 2 | phandle 7e740eb0 slot 30 hash 3e | phandle 7e745390 slot 10 hash 33 | phandle 7e747b08 slot 8 hash c | phandle 7e748528 slot 28 hash f | phandle 7e74a6e0 slot 20 hash 18 | phandle 7e74aab0 slot 30 hash b | phandle 7e74f788 slot 8 hash d | Used entries: 8, hashed: 29 So the hash array has 64 entries out which only 8 are populated. Using hash_32() populates 29 entries. Could someone with real hardware verify this? I'm not sure how important this performance wise, it looks just like a waste using only 1/8 of the array. The patch used for testing: diff --git a/drivers/of/base.c b/drivers/of/base.c index 1d667eb730e19..2640d4bc81a9a 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -197,6 +197,7 @@ void of_populate_phandle_cache(void) u32 cache_entries; struct device_node *np; u32 phandles = 0; + struct device_node **cache2; raw_spin_lock_irqsave(_lock, flags); @@ -214,14 +215,32 @@ void of_populate_phandle_cache(void) phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache), GFP_ATOMIC); + cache2 = kcalloc(cache_entries, sizeof(*phandle_cache), GFP_ATOMIC); if (!phandle_cache) goto out; + pr_err("%s(%d) entries: %d\n", __func__, __LINE__, cache_entries); for_each_of_allnodes(np) if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) { + int slot; of_node_get(np); phandle_cache[np->phandle & phandle_cache_mask] = np; + slot = hash_32(np->phandle, __ffs(cache_entries)); + cache2[slot] = np; + pr_err("%s(%d) phandle %x slot %x hash %x\n", __func__, __LINE__, + np->phandle, np->phandle & phandle_cache_mask, slot); } + { + int i, filled = 0, filled_hash = 0; + + for (i = 0; i < cache_entries; i++) { + if (phandle_cache[i]) + filled++; + if (cache2[i]) + filled_hash++; + } + pr_err("%s(%d) Used entries: %d, hashed: %d\n", __func__, __LINE__, filled, filled_hash); + } out: raw_spin_unlock_irqrestore(_lock, flags); Sebastian