Re: String handling optimisation

2012-04-05 Thread Arne Goedeke
Do you have any statistics about what the hit rate of the identifier lookup cache is? How many identifier lookups actually use the binary search? On Thu, 29 Mar 2012, Per Hedbor () @ Pike (-) developers forum wrote: If it was not clear, before CRC32, almost 50% of share-string time was spent

Re: String handling optimisation

2012-04-01 Thread Martin Stjernholm, Roxen IS @ Pike developers forum
Even if we were to get rid of the global lock, a global hash table for strings probably wouldn't be a significant problem since it can be made lock free.

String handling optimisation

2012-04-01 Thread Martin Stjernholm, Roxen IS @ Pike developers forum
The problem is that there may be work patterns where there's a significant risk of getting very long identical strings, e.g. if a file is read and cached for some time and then read again from another part of the program, or if the same file is read concurrently by different threads. What's

String handling optimisation

2012-03-29 Thread Stephen R. van den Berg
Does anyone know how often in the code we actually depend on the fact that the same string will be at the same address in memory? Because I'm contemplating an optimisation which would involve making the string duplication avoidance opportunistic instead of mandatory. I.e. something along the

Re: String handling optimisation

2012-03-29 Thread Arne Goedeke
On Thu, 29 Mar 2012, Stephen R. van den Berg wrote: Because I'm contemplating an optimisation which would involve making the string duplication avoidance opportunistic instead of mandatory. I guess the point here is to skip the hashing in cases where the strings are large, come from the

Re: String handling optimisation

2012-03-29 Thread Stephen R. van den Berg
Arne Goedeke wrote: On Thu, 29 Mar 2012, Stephen R. van den Berg wrote: Because I'm contemplating an optimisation which would involve making the string duplication avoidance opportunistic instead of mandatory. I guess the point here is to skip the hashing in cases where the strings are large,

String handling optimisation

2012-03-29 Thread Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum
(i.e. they're not fully hashed all the time, to avoid the overhead of rehashing large strings repeatedly when juggling around lots of strings). Large strings are not fully hashed. The hash function will consider at most 72 characters. So strings longer than that will not take longer to hash

Re: String handling optimisation

2012-03-29 Thread Per Hedbor () @ Pike (-) developers forum
That's exactly what I'm asking... How many places are there where we explicitly depend on the fact that the address can be used to define uniqueness? All places where strings are compared. Say, a few thousand places in the code, probably? Most importantly: Mappings and multiset, identifiers

Re: String handling optimisation

2012-03-29 Thread Jonas Walld�n @ Pike developers forum
The issue isn't necessarily the hashing but the fact that you need to have this globally synced instead of e.g. creating a thread-local string pool. Still, I agree with you that fundamental properties of mappings etc are based on string uniqueness. There are other low-hanging fruit that should be

Re: String handling optimisation

2012-03-29 Thread Per Hedbor () @ Pike (-) developers forum
The issue isn't necessarily the hashing but the fact that you need to have this globally synced instead of e.g. creating a thread-local string pool. Well. Yes, but as long as we do not have actual threads that can run concurrently in pike this is not much of an issue, really. C-code can (and

String handling optimisation

2012-03-29 Thread Henrik Grubbstr�m (Lysator) @ Pike (-) developers forum
Does anyone know how often in the code we actually depend on the fact that the same string will be at the same address in memory? Often, but it's probably not hard to find a set of gatekeeper functions that cover all the cases. Because I'm contemplating an optimisation which would involve making

String handling optimisation

2012-03-29 Thread Per Hedbor () @ Pike (-) developers forum
If it was not clear, before CRC32, almost 50% of share-string time was spent in the hash function. But it was still a very small percentage of the total CPU time used. I optimized it because it was easy to do, mostly. We spend significantly more time looking up identifiers in objects, as an

String handling optimisation

2012-03-29 Thread Per Hedbor () @ Pike (-) developers forum
Do you have some statistics on this? I'd imagine that most time spent on comparising hash-hits would be on short to medium length strings, and not really long ones since it's unlikely you'll find another string with the exact same length in the hash bucket. I do have some statistics from the

String handling optimisation

2012-03-29 Thread Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum
Note that it typically isn't the calculation of the hash that is expensive, but the comparison on hash-hit. Do you have some statistics on this? I'd imagine that most time spent on comparising hash-hits would be on short to medium length strings, and not really long ones since it's unlikely