Not to dispute your point but more to clarify mine. Mostly I wanted make the minor note about the length test preventing most char-by-char comparison (assuming intern or other canonicalization taking care of equality, as in the rest of the discussion).

Hash code was an afterthought, which came to mind since I had recently been researching string canonicalization alternatives to intern (eg via a HashSet). I was only suggesting hashCode if *repeated* char-by-char comparison in unequal strings is causing performance problems (the case of same length strings with shared prefix was the most obvious; by the sound of it SAML may actually make this relevant in this case). The part I apparently didn't emphasize enough is that yes, it only offers advantage if strings are used repeatedly in problem comparisons (or hashCode has already been used): .hashCode is calculated lazily so will only be calculated once per string, (except for the unlikely case where hash code matches the sentinel value 0) - so for repeated use over a restricted set of strings the overhead can be amortized (intern internally calculates hash code for strings but AFAIK this is not currently used to preset this cached code, so there is will be hash calculation and associated cache churn one-off; non intern canonicalization using a hash table it will have cached the result so get it for "free").


Raul Benito wrote:
As the original author of the changes of equals to == in intern namespaces,
I can tell that original in 1.4 and 1.5 and with my data (that was the
verification of a SAML/Liberty AuthnReq in a multi thread tests, and the old
Juice JCE provider). The change was 10% to 20% faster.
The SAML is one of the real example of signing and has some url with common
prefixes and same length url.
The Juice provider also helps to get rid of the signing/digest cost (a
verification is two c14n one of the signing part and c14n of the signature),
but i think just a c14n is a good way of measure it.
Also take into account that the == vs equals debate is more a memory
workload cache problem, if we have to iterate over and over every char just
to see if it is not equals, we trash the cache (That's why i used the multi
thread to simulate a server decoding requests with more or less the same
code, but in different times and different "workload")
Nevertheless  if you have test with a more modern jre and the code .equals
is behaving better, just go ahead and kiss goodbye to  the ==.

Clive, using the .hashCode for strings in this case is not a big speed-up as
it is going to go through all the chars of the string, trashing cache again,
and multiplying and adding the result to an integer, instead of a fail in
the first different char or just summarize to a boolean.\

Regards,


On Tue, Aug 10, 2010 at 2:37 AM, Clive Brettingham-Moore <
xml...@brettingham-moore.net> wrote:

Have to agree .equals is the way to go, since correctness of == is too
reliant on what must be considered implementation optimisations in the
parser.

Benchmarking in JVM is notoriously difficult, but it does look like
there is no gross difference, which should kill any objections to doing
it correctly.

Since I recently spend far to long researching this for an unrelated
problem I'll add my 10c to the detail discussion.

On 10/08/10 01:23, Chad La Joie wrote:

Not necessarily, there are a number of not equal checks in there that
should, in theory, perform better if you only use == only.  In such a
case, the use of != will just be a single check while !equals() will
result in a char-by-char comparison.
Actually, the next thing String.equals tests is length equality - so
character comparison will only be reached if the strings are the same
length.

Since the char by char comparison returns on the first mismatch, then
only same length strings with shared prefixes will show the expected
slowness. (namespace URIs are likely to share prefixes, but I think are
not particularly likely to be the same length, unless actually equal)...
thus String.equals is only likely to be slow where comparing long
distinct but equal strings (so intern or alternative string pooling
techniques needed for == benefit .equals without all the nasty
loopholes: even if .equals is occasionally slow, at least it is always
right).

In circumstances where doing repeated tests with many length and prefix
matches, adding a hash code inequality test ((s1.hashCode()==
s2.hashCode())&&s1.equals(s2)) could prevent practically all
char-by-char checks for !equal cases (but if the same strings are never
repeatedly used, the hash code calculation could be an issue; nb intern
results in hash calculation for all strings anyway)... pooling is still
needed to speed up matches for equality though.

Re VM options I would feel -server is definitely the right test bed,
both because of the more aggressive JIT, and also because the code is
likely to see heaviest real world cases in -server VMs.




Reply via email to