Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-25 Thread Alan Snyder
> I think the original idea was that the "general contract" was that 
> collections contained elements whose membership was defined by equals(). 
> However, for a very long time now there have been collections in the JDK such 
> as SortedSet and IdentityHashMap (more precisely, IdentityHashMap's keyset). 
> I think we have to accept the fact that these are collections even though 
> their membership semantics differ from those provided by equals().

Perhaps we must accept their placement in the type hierarchy for compatibility 
reasons, but it seems reasonable to me to declare these exceptional classes to 
be nonstandard, which means that their behavior when used as a Collection might 
be unexpected.

An alternative, perhaps, would be to add methods that describe the nature of 
the membership, e.g. whether based on equals/hash or a comparator (or something 
else?), which would then become part of the general contract. You’ll probably 
call this a non-starter :-)

> After these are fixed, the specification and implementation will at least be 
> self-consistent. However, it will still be the case that HashSet, TreeSet, 
> and IdentityHashMap's keyset will have different membership semantics, and if 
> you pass them around interchangeably you might get unexpected results. I 
> think we just have to live with that.


I’m not sure what you mean by interchangeably. A TreeSet could misbehave when 
passed to a TreeSet if the method assumes the “general contract” is satisfied 
by its parameter. Or are you adding a special-case requirement that every 
method on a Collection that takes a Collection parameter must behave properly 
when the actual argument is an instance of the same class, or something like 
that?

> Once we've admitted to the fact that different collections can have different 
> membership semantics, the main point of this bug (JDK-6982173) is to fix up 
> the semantics of AbstractSet.removeAll() to remove the semantic shifting that 
> can occur based on the relative sizes of the collections.

If we declare that collections can have different membership semantics and 
still satisfy the general contract of Collection, then we need to come up with 
a new specification of the general contract.

However, if we specify that membership based on anything other than equals/hash 
is nonstandard, then this “bug” is not a bug. It is a valid implementation on 
the assumption that the parameter satisfies the general contract of the 
parameter type. The behavior that you call a semantic shift, I would call an 
unexpected behavior resulting from using a nonstandard collection where a 
Collection was expected. I also believe that the performace penalty of fixing 
this bug is “astonishing”. One could argue whether the performance penalty is 
more or less astonishing than the unexpected behavior when a non-standard 
collection is used as the parameter.

> Introducing a new interface type is basically a non-starter.


Hopefully, what you mean is that introducting a new interface type and 
incompatibly rearranging the type hierarchy is a non-starter.

I still think it would be useful to:

1. introduce a Membership interface

2. have Collection extend Membership

3. add a method to Collection: removeAllMembers(Membership)

> Objects of type Collection get passed around a lot, and users can usefully 
> rely on things like size(), they can iterate the Collection, etc.

Except when they are implementing something like removeAll()? They can iterate 
but not assume that they are getting all the objects that are members of the 
collection, but instead they are getting representative objects?

  Alan


> On Feb 25, 2019, at 11:06 AM, Stuart Marks  wrote:
> 
> 
> 
> On 2/15/19 11:30 AM, Alan Snyder wrote:
>> I think this situation is a mess.
>> The “general contract” of a Collection, I believe, is that it contains zero
>> or more identified member objects, as defined by appropriate equals() method.
>> I know this is hard to specify precisely, but I presume we all “know it when
>> we see it”.
>> There is value to “collections” whose members are not objects but are
>> equivalence classes of objects, as defined by a nonstandard equality test,
>> but I think it is a mistake to call them Collections.
>> If a method takes a parameter of type Collection, it should be free to assume
>> that the parameter object supports the “general contract” of Collection. Is
>> there any plausible alternative?
> 
> [back from vacation]
> 
> Yes, this situation is something of a mess.
> 
> I think the original idea was that the "general contract" was that 
> collections contained elements whose membership was defined by equals(). 
> However, for a very long time now there have been collections in the JDK such 
> as SortedSet and IdentityHashMap (more precisely, IdentityHashMap's keyset). 
> I think we have to accept the fact that these are collections even though 
> their membership semantics differ from those provided by equals().
> 
>> Changing one 

Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-25 Thread Stuart Marks




On 2/15/19 11:30 AM, Alan Snyder wrote:

I think this situation is a mess.

The “general contract” of a Collection, I believe, is that it contains zero
or more identified member objects, as defined by appropriate equals() method.
I know this is hard to specify precisely, but I presume we all “know it when
we see it”.

There is value to “collections” whose members are not objects but are
equivalence classes of objects, as defined by a nonstandard equality test,
but I think it is a mistake to call them Collections.

If a method takes a parameter of type Collection, it should be free to assume
that the parameter object supports the “general contract” of Collection. Is
there any plausible alternative?


[back from vacation]

Yes, this situation is something of a mess.

I think the original idea was that the "general contract" was that collections 
contained elements whose membership was defined by equals(). However, for a very 
long time now there have been collections in the JDK such as SortedSet and 
IdentityHashMap (more precisely, IdentityHashMap's keyset). I think we have to 
accept the fact that these are collections even though their membership 
semantics differ from those provided by equals().



Changing one method in the JDK to support a non-standard Collection parameter
does not solve the problem, because non-JDK methods with Collection (etc.)
parameters could have similar “misbehavior”. How would the developer know
when a specific TreeSet can or cannot be passed to a method? Does every
method that accepts a Collection (etc.) parameter require boilerplate (as in
the disjoint example) explaining its exact requirements or how it can go
wrong?

Perhaps it would be useful to define specific behaviors that these
nonstandard “collections” might support. For example, a Membership interface
(with method contains(e)) would be perfect for a removeAll(Membership) method
on Collection, implemented as you propose.
The wording on Collections.disjoint() is particularly clumsy. I'm not entirely 
sure what to do about it and its implementation; suggestions welcome. But I also 
don't think Collections.disjoint() is a very important problem, compared to 
what's going on in the rest of the framework.


Introducing a new interface type is basically a non-starter. Objects of type 
Collection get passed around a lot, and users can usefully rely on things like 
size(), they can iterate the Collection, etc.


Once we've admitted to the fact that different collections can have different 
membership semantics, the main point of this bug (JDK-6982173) is to fix up the 
semantics of AbstractSet.removeAll() to remove the semantic shifting that can 
occur based on the relative sizes of the collections.


There's another bug JDK-8190545 that needs to be fixed at some point, which is a 
larger scale cleanup to clarify and remove assumptions about collection 
membership. It starts out from the point that SortedSet and its ilk define 
membership using a Comparator; yet TreeSet.contains() still talks about using 
equals(), which is simply incorrect. Pulling on that thread for a while reveals 
a bunch of places in the spec that need adjustment. See the comments in the bug 
report and the linked email discussions.


After these are fixed, the specification and implementation will at least be 
self-consistent. However, it will still be the case that HashSet, TreeSet, and 
IdentityHashMap's keyset will have different membership semantics, and if you 
pass them around interchangeably you might get unexpected results. I think we 
just have to live with that.


s'marks


Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-15 Thread Alan Snyder
I think this situation is a mess.

The “general contract” of a Collection, I believe, is that it contains zero or 
more identified member objects, as defined by appropriate equals() method. I 
know this is hard to specify precisely, but I presume we all “know it when we 
see it”.

There is value to “collections” whose members are not objects but are 
equivalence classes of objects, as defined by a nonstandard equality test, but 
I think it is a mistake to call them Collections.

If a method takes a parameter of type Collection, it should be free to assume 
that the parameter object supports the “general contract” of Collection. Is 
there any plausible alternative?

Changing one method in the JDK to support a non-standard Collection parameter 
does not solve the problem, because non-JDK methods with Collection (etc.) 
parameters could have similar “misbehavior”. How would the developer know when 
a specific TreeSet can or cannot be passed to a method? Does every method that 
accepts a Collection (etc.) parameter require boilerplate (as in the disjoint 
example) explaining its exact requirements or how it can go wrong?

Perhaps it would be useful to define specific behaviors that these nonstandard 
“collections” might support. For example, a Membership interface (with method 
contains(e)) would be perfect for a removeAll(Membership) method on Collection, 
implemented as you propose.

  Alan




> On Feb 15, 2019, at 10:44 AM, Stuart Marks  wrote:
> 
> 
> 
> On 2/14/19 9:30 AM, Alan Snyder wrote:
>>> Care must be exercised if this method is used on collections that
>>> do not comply with the general contract for {@code Collection}.
>> So, what does this mean? Are we catering to incorrect implementations?
> 
> I think this is a quote from the specification of Collections.disjoint():
> 
>> * Care must be exercised if this method is used on collections that
>> * do not comply with the general contract for {@code Collection}.
>> * Implementations may elect to iterate over either collection and test
>> * for containment in the other collection (or to perform any equivalent
>> * computation).  If either collection uses a nonstandard equality test
>> * (as does a {@link SortedSet} whose ordering is not compatible with
>> * equals, or the key set of an {@link IdentityHashMap}), both
>> * collections must use the same nonstandard equality test, or the
>> * result of this method is undefined.
> 
> (Collections.disjoint() has similar issues to removeAll/retainAll but I don't 
> think it's as important to fix, nor do I think it necessarily must be made 
> consistent with them. But that's kind of a separate discussion.)
> 
> I dislike the phrase "general contract" because it's, well, too general. I 
> think it mainly refers to what I've been referring to as "contains() 
> semantics" or "membership semantics". The rest of the paragraph refers to 
> cases where a "nonstandard equality test" is used. I prefer to say that the 
> membership semantics of things like IdentityHashMap and SortedSet differ from 
> the membership semantics specified in the Set interface.
> 
> Specifically, Set says
> 
>sets contain no pair of elements e1 and e2 such that e1.equals(e2)
> 
> whereas IdentityHashMap implies that it cannot contain any two keys k1 and k2 
> such that k1 == k2, and SortedSet should say that it contains no pair of 
> elements e1 and e2 such that compare(e1, e2) == 0.
> 
> (This last is the subject that JDK-8190545 is intended to address.)
> 
> The upshot is that Collections.disjoint might not work as expected if it's 
> used on collections that have different membership semantics.
> 
> s'marks
> 



Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-15 Thread Stuart Marks




On 2/14/19 9:30 AM, Alan Snyder wrote:

Care must be exercised if this method is used on collections that
do not comply with the general contract for {@code Collection}.


So, what does this mean? Are we catering to incorrect implementations?


I think this is a quote from the specification of Collections.disjoint():


 * Care must be exercised if this method is used on collections that
 * do not comply with the general contract for {@code Collection}.
 * Implementations may elect to iterate over either collection and test
 * for containment in the other collection (or to perform any equivalent
 * computation).  If either collection uses a nonstandard equality test
 * (as does a {@link SortedSet} whose ordering is not compatible with
 * equals, or the key set of an {@link IdentityHashMap}), both
 * collections must use the same nonstandard equality test, or the
 * result of this method is undefined.


(Collections.disjoint() has similar issues to removeAll/retainAll but I don't 
think it's as important to fix, nor do I think it necessarily must be made 
consistent with them. But that's kind of a separate discussion.)


I dislike the phrase "general contract" because it's, well, too general. I think 
it mainly refers to what I've been referring to as "contains() semantics" or 
"membership semantics". The rest of the paragraph refers to cases where a 
"nonstandard equality test" is used. I prefer to say that the membership 
semantics of things like IdentityHashMap and SortedSet differ from the 
membership semantics specified in the Set interface.


Specifically, Set says

sets contain no pair of elements e1 and e2 such that e1.equals(e2)

whereas IdentityHashMap implies that it cannot contain any two keys k1 and k2 
such that k1 == k2, and SortedSet should say that it contains no pair of 
elements e1 and e2 such that compare(e1, e2) == 0.


(This last is the subject that JDK-8190545 is intended to address.)

The upshot is that Collections.disjoint might not work as expected if it's used 
on collections that have different membership semantics.


s'marks


Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-14 Thread Martin Buchholz
On Wed, Feb 13, 2019 at 6:45 PM Stuart Marks 
wrote:

>
> If we all agree on this, I'll proceed with working up a changeset for
> JDK-6394757.[3] It's time to fix this one.
>

Thanks for taking on the fixing of this unfixable problem.
It's important to do lots of documentation/guidance work, esp. a release
note.
Doug and I will take care of anything that needs doing  in jsr166.

In the real world, if performance is important, one might want to sort both
collections (or use collections that are naturally sorted), then do a
two-finger walk through both collections, funneling the results into a
target collection.
Which is N*log(N) + M*log(M) + N + M
With enough work, that might be library-fiable.
Hm ...
Imagine ConcurrentSkipListSet.removeAll(aNavigableSet) ...
Then by two-finger through both collections, advancing using
higher/ceiling, you should get something like O(min(N*log(N), M*log(M)))
If the argument is not sorted, then it is likely to be worth the effort of
copying it into a sorted set (TreeSet with same comparator?).
Software is hard.


Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-14 Thread Alan Snyder
Right, I see that now.

Care must be exercised if this method is used on collections that
do not comply with the general contract for {@code Collection}.

So, what does this mean? Are we catering to incorrect implementations?



> On Feb 13, 2019, at 9:07 PM, Stuart Marks  wrote:
> 
> On 2/13/19 7:22 PM, Alan Snyder wrote:
>> If we take this route, how about changing the parameter type to Iterable?
> 
> Won't work. Where I've ended up is that we need to iterate over "this" 
> collection and, for each element, call contains() on the parameter. The 
> AbstractCollection.removeAll() implementation does something like this:
> 
>removeAll(Collection c) {
>for (Iterator it = iterator(); it.hasNext();) {
>if (c.contains(it.next())) {
>it.remove();
>}
>}
>}
> 
> Since we call contains() on the parameter, it has to be a Collection.
> 
> s'marks
> 



Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-13 Thread Stuart Marks

On 2/13/19 7:22 PM, Alan Snyder wrote:

If we take this route, how about changing the parameter type to Iterable?


Won't work. Where I've ended up is that we need to iterate over "this" 
collection and, for each element, call contains() on the parameter. The 
AbstractCollection.removeAll() implementation does something like this:


removeAll(Collection c) {
for (Iterator it = iterator(); it.hasNext();) {
if (c.contains(it.next())) {
it.remove();
}
}
}

Since we call contains() on the parameter, it has to be a Collection.

s'marks


Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-13 Thread Alan Snyder
If we take this route, how about changing the parameter type to Iterable?

  Alan



> On Feb 13, 2019, at 7:12 PM, Stuart Marks  wrote:
> 
> Right, as I mentioned in my earlier emails [1][2] this is related to 
> JDK-6394757 [3] where the semantics shift depending on the relative sizes of 
> the collections. This has a complicated history. In JDK-6982173 [4] there is 
> a lot of discussion about what heuristics to use for iterating this vs the 
> arg, in order to get the best performance. This assumes that these operations 
> are equivalent, when they aren't! In JDK-6394757 there's some discussion 
> about the dubious semantics of this optimization and the potential of simply 
> removing it.
> 
> At this point I think we need to remove the performance heurstic in order to 
> get consistent semantics. The performance won't be optimal -- indeed, it 
> wasn't before -- but at least it should be more predictable.
> 
> Your example is indeed counterintuitive, as it uses the contains() semantics 
> of the argument -- the string length -- instead of the contains() semantics 
> of "this". However, it's defensible, and it's supported by the spec, so I 
> don't really see any alternative.
> 
> The example set is pretty counterintuitive in the first place. For example:
> 
> jshell> Set set = new 
> TreeSet<>(Comparator.comparingInt(String::length))
> jshell> set.addAll(List.of("a", "if", "the", "when"))
> jshell> set
> set ==> [a, if, the, when]
> jshell> set.contains("foo")
> $235 ==> true
> 
> Also, this comparator isn't consistent with equals. This isn't a bug. But it 
> does give rise to even more counterintuitive behavior:
> 
> jshell> var set2 = Set.of("x", "xx", "xxx", "")
> jshell> set2.equals(set)
> $237 ==> false
> jshell> set.equals(set2)
> $238 ==> true
> 
> Fun with collections! :-)
> 
> s'marks
> 
> 
> [1] 
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058172.html
> 
> [2] 
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058539.html
> 
> [3] https://bugs.openjdk.java.net/browse/JDK-6394757
> 
> [4] https://bugs.openjdk.java.net/browse/JDK-6982173
> 
> 
> On 2/11/19 2:23 PM, Michael Rasmussen wrote:
>> The current implementation seems very counter-intuitive with anything that 
>> doesn't have the same comparison semantics, for instance a TreeSet/Map with 
>> a Comparator, Identity-based Set/Map etc.
>> The output of the following snippet would probably surprise most:
>> /* --- snip --- */
>> import java.util.*;
>> class Test {
>>   public static void main(String[] args) {
>> TreeMap> lengthMap = new 
>> TreeMap<>(Comparator.comparingInt(String::length));
>> List strings = new ArrayList<>(Arrays.asList("The quick brown 
>> fox jumps over the lazy dog".split(" ")));
>> for (String s : strings) {
>>   lengthMap.computeIfAbsent(s, k -> new ArrayList<>()).add(s);
>> }
>>   Set toRemove = lengthMap.keySet();
>>   System.out.println("List before: " + strings);
>>     System.out.println("Set to remove: " + toRemove);
>> strings.removeAll(toRemove);
>> System.out.println("List after:" + strings);
>>   }
>> }
>> /* --- snip --- */
>> List before: [The, quick, brown, fox, jumps, over, the, lazy, dog]
>> Set to remove: [The, over, quick]
>> List after:[]
>> /Michael
>> From: core-libs-dev  on behalf of 
>> Tagir Valeev 
>> Sent: 08 February 2019 15:13
>> To: Stuart Marks
>> Cc: core-libs-dev; Jan Peter Stotz
>> Subject: Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
>>   Hello!
>>> I would argue that iterating the argument and calling remove() on "this" 
>>> are the
>>> right semantics, because you want set membership to be determined by this 
>>> set,
>>> not by whatever collection you pass as an argument. However, I note that
>>> AbstractCollection.removeAll and other removeAll implementations iterate 
>>> over
>>> "this" and do a contains() check on the argument. The revised
>>> AbstractSet.removeAll would be an outlier here, though it makes sense to me 
>>> to
>>> do it this way.
>> For complete picture it should be noted that there's a slight
>> difference in remove and removeAll spec: remove removes at most one
>> element while removeAll removes all elements from the specified
>> co

Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-13 Thread Stuart Marks
Right, as I mentioned in my earlier emails [1][2] this is related to JDK-6394757 
[3] where the semantics shift depending on the relative sizes of the 
collections. This has a complicated history. In JDK-6982173 [4] there is a lot 
of discussion about what heuristics to use for iterating this vs the arg, in 
order to get the best performance. This assumes that these operations are 
equivalent, when they aren't! In JDK-6394757 there's some discussion about the 
dubious semantics of this optimization and the potential of simply removing it.


At this point I think we need to remove the performance heurstic in order to get 
consistent semantics. The performance won't be optimal -- indeed, it wasn't 
before -- but at least it should be more predictable.


Your example is indeed counterintuitive, as it uses the contains() semantics of 
the argument -- the string length -- instead of the contains() semantics of 
"this". However, it's defensible, and it's supported by the spec, so I don't 
really see any alternative.


The example set is pretty counterintuitive in the first place. For example:

jshell> Set set = new TreeSet<>(Comparator.comparingInt(String::length))
jshell> set.addAll(List.of("a", "if", "the", "when"))
jshell> set
set ==> [a, if, the, when]
jshell> set.contains("foo")
$235 ==> true

Also, this comparator isn't consistent with equals. This isn't a bug. But it 
does give rise to even more counterintuitive behavior:


jshell> var set2 = Set.of("x", "xx", "xxx", "")
jshell> set2.equals(set)
$237 ==> false
jshell> set.equals(set2)
$238 ==> true

Fun with collections! :-)

s'marks


[1] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058172.html

[2] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058539.html

[3] https://bugs.openjdk.java.net/browse/JDK-6394757

[4] https://bugs.openjdk.java.net/browse/JDK-6982173


On 2/11/19 2:23 PM, Michael Rasmussen wrote:

The current implementation seems very counter-intuitive with anything that 
doesn't have the same comparison semantics, for instance a TreeSet/Map with a 
Comparator, Identity-based Set/Map etc.

The output of the following snippet would probably surprise most:
/* --- snip --- */
import java.util.*;
class Test {
   public static void main(String[] args) {
 TreeMap> lengthMap = new 
TreeMap<>(Comparator.comparingInt(String::length));
 List strings = new ArrayList<>(Arrays.asList("The quick brown fox jumps over the 
lazy dog".split(" ")));
 for (String s : strings) {
   lengthMap.computeIfAbsent(s, k -> new ArrayList<>()).add(s);
 }
  
 Set toRemove = lengthMap.keySet();
  
 System.out.println("List before: " + strings);

 System.out.println("Set to remove: " + toRemove);

 strings.removeAll(toRemove);
 System.out.println("List after:" + strings);
   }
}
/* --- snip --- */
List before: [The, quick, brown, fox, jumps, over, the, lazy, dog]
Set to remove: [The, over, quick]
List after:[]


/Michael


From: core-libs-dev  on behalf of Tagir 
Valeev 
Sent: 08 February 2019 15:13
To: Stuart Marks
Cc: core-libs-dev; Jan Peter Stotz
Subject: Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
   


Hello!


I would argue that iterating the argument and calling remove() on "this" are the
right semantics, because you want set membership to be determined by this set,
not by whatever collection you pass as an argument. However, I note that
AbstractCollection.removeAll and other removeAll implementations iterate over
"this" and do a contains() check on the argument. The revised
AbstractSet.removeAll would be an outlier here, though it makes sense to me to
do it this way.


For complete picture it should be noted that there's a slight
difference in remove and removeAll spec: remove removes at most one
element while removeAll removes all elements from the specified
collection.
E.g. c.removeAll(Collections.singleton(foo)) would remove all
instances of foo from c while c.remove(foo) would return only one foo.
These should be equivalent for Set where repeating elements should not
normally appear, but it's wrong for any Collection. That's why
AbstractCollection.removeAll
cannot be rewritten in the same way.

Another interesting thing is
java.util.IdentityHashMap.KeySet#removeAll implementation [1]:
it reimplements the AbstractCollection#removeAll, because of the same
reason: now
the first branch of AbstractSet#removeAll becomes incorrect. See the
comment before it:

     /*
  * Must revert from AbstractSet's impl to AbstractCollection's, as
  * the former contains an optimization that results in incorrect
  * behavior when c is a smaller "normal" (non-identity-based) Set.
  */

Btw thi

Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-13 Thread Stuart Marks




On 2/8/19 5:13 AM, Tagir Valeev wrote:

I would argue that iterating the argument and calling remove() on "this" are the
right semantics, because you want set membership to be determined by this set,
not by whatever collection you pass as an argument. However, I note that
AbstractCollection.removeAll and other removeAll implementations iterate over
"this" and do a contains() check on the argument. The revised
AbstractSet.removeAll would be an outlier here, though it makes sense to me to
do it this way.


For complete picture it should be noted that there's a slight
difference in remove and removeAll spec: remove removes at most one
element while removeAll removes all elements from the specified
collection.
E.g. c.removeAll(Collections.singleton(foo)) would remove all
instances of foo from c while c.remove(foo) would return only one foo.
These should be equivalent for Set where repeating elements should not
normally appear, but it's wrong for any Collection. That's why
AbstractCollection.removeAll
cannot be rewritten in the same way.


Yes, this is a good point. This increases my concern over my earlier proposal 
for AbstractSet.removeAll being an "outlier".


The main issue here is whose semantics for containership are used. I had 
proposed iterating "arg" and calling this.remove() because that makes sense in 
isolation. However, AbstractCollection iterates "this" and calls arg.contains(). 
As you note, the specification implies that it must be done this way. This 
conflicts with my AbstractSet proposal.


In addition, there is the matter of retainAll(). You can't iterate "arg" and 
call this.remove(), since you need to remove elements that *aren't* present in 
"arg". At least, you can't without building an intermediate data structure, 
which seems out of bounds here. So retainAll() pretty much needs to iterate 
"this" and call arg.contains().


Based on this, I've changed my mind. AbstractSet.removeAll() really should do 
the same thing as AbstractCollection.removeAll(), namely to iterate "this" and 
call arg.contains(). In fact, AbstractSet can simply inherit the 
AbstractCollection implementation, and its specification adjusted accordingly.


Unfortunately, this means that code like the following:

someSet.removeAll(someList)

will have O(m*n) performance, since it will repeatedly call someList.contains(). 
This doesn't alleviate the performance concern. However, it's at least specified 
and predictable, so one can work around it if necessary.


Also,

set1.removeAll(set2)

will unconditionally use set2's semantics for containership. This is a bit 
counterintuitive, as one might expect set1's semantics to be used. But it needs 
to be this way to be consistent with Collection.removeAll(), 
Collection.retainAll(), and Set.retainAll(). And at least it's predictable, and 
it doesn't vary based on the relative sizes of the sets!



Another interesting thing is
java.util.IdentityHashMap.KeySet#removeAll implementation [1]:
it reimplements the AbstractCollection#removeAll, because of the same
reason: now
the first branch of AbstractSet#removeAll becomes incorrect. See the
comment before it:

 /*
  * Must revert from AbstractSet's impl to AbstractCollection's, as
  * the former contains an optimization that results in incorrect
  * behavior when c is a smaller "normal" (non-identity-based) Set.
  */

Btw this comment should be updated to remove a "smaller" condition if
the proposed
change for AbstractSet will be implemented.

[1] 
http://hg.openjdk.java.net/jdk/jdk/file/e57bcfd7bf79/src/java.base/share/classes/java/util/IdentityHashMap.java#l990


Also a good point. In fact, with my current thinking, this comment and the 
override can be removed, since it can be inherited from AbstractCollection. 
(This also applies to EntrySet later in the same file.)


**

I also note that ConcurrentHashMap's CollectionView nested class [1] has a 
removeAll() implementation that uses a similar (but slightly different) 
heuristic for deciding whether to iterate "this" or "arg". I think this 
heuristic should be removed, since it suffers from the same semantic shift 
depending on relative sizes.


I further note that ConcurrentSkipListSet.removeAll() overrides 
AbstractSet.removeAll() in order to avoid an expensive call to size(). But it 
always iterates its arg and removes from this, which is backwards from 
everywhere else. (But at least it doesn't do any size-based "optimization".) I 
think this is simply a bug. I've filed JDK-8218945 [2] to cover this.


**

If we all agree on this, I'll proceed with working up a changeset for 
JDK-6394757.[3] It's time to fix this one.


s'marks


[1] 
http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/java/util/concurrent/ConcurrentHashMap.java#l4538


[2] https://bugs.openjdk.java.net/browse/JDK-8218945

[3] https://bugs.openjdk.java.net/browse/JDK-6394757


Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-11 Thread Michael Rasmussen
The current implementation seems very counter-intuitive with anything that 
doesn't have the same comparison semantics, for instance a TreeSet/Map with a 
Comparator, Identity-based Set/Map etc.

The output of the following snippet would probably surprise most:
/* --- snip --- */
import java.util.*;
class Test {
  public static void main(String[] args) {
TreeMap> lengthMap = new 
TreeMap<>(Comparator.comparingInt(String::length));
List strings = new ArrayList<>(Arrays.asList("The quick brown fox 
jumps over the lazy dog".split(" ")));
for (String s : strings) {
  lengthMap.computeIfAbsent(s, k -> new ArrayList<>()).add(s);
}
 
Set toRemove = lengthMap.keySet();
 
System.out.println("List before: " + strings);
System.out.println("Set to remove: " + toRemove);

strings.removeAll(toRemove);
System.out.println("List after:" + strings);
  }
}
/* --- snip --- */
List before: [The, quick, brown, fox, jumps, over, the, lazy, dog]
Set to remove: [The, over, quick]
List after:[]


/Michael


From: core-libs-dev  on behalf of Tagir 
Valeev 
Sent: 08 February 2019 15:13
To: Stuart Marks
Cc: core-libs-dev; Jan Peter Stotz
Subject: Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
  

Hello!

> I would argue that iterating the argument and calling remove() on "this" are 
> the
> right semantics, because you want set membership to be determined by this set,
> not by whatever collection you pass as an argument. However, I note that
> AbstractCollection.removeAll and other removeAll implementations iterate over
> "this" and do a contains() check on the argument. The revised
> AbstractSet.removeAll would be an outlier here, though it makes sense to me to
> do it this way.

For complete picture it should be noted that there's a slight
difference in remove and removeAll spec: remove removes at most one
element while removeAll removes all elements from the specified
collection.
E.g. c.removeAll(Collections.singleton(foo)) would remove all
instances of foo from c while c.remove(foo) would return only one foo.
These should be equivalent for Set where repeating elements should not
normally appear, but it's wrong for any Collection. That's why
AbstractCollection.removeAll
cannot be rewritten in the same way.

Another interesting thing is
java.util.IdentityHashMap.KeySet#removeAll implementation [1]:
it reimplements the AbstractCollection#removeAll, because of the same
reason: now
the first branch of AbstractSet#removeAll becomes incorrect. See the
comment before it:

    /*
 * Must revert from AbstractSet's impl to AbstractCollection's, as
 * the former contains an optimization that results in incorrect
 * behavior when c is a smaller "normal" (non-identity-based) Set.
 */

Btw this comment should be updated to remove a "smaller" condition if
the proposed
change for AbstractSet will be implemented.

[1]  
http://hg.openjdk.java.net/jdk/jdk/file/e57bcfd7bf79/src/java.base/share/classes/java/util/IdentityHashMap.java#l990

With best regards,
Tagir Valeev


On Fri, Jan 25, 2019 at 7:11 AM Stuart Marks  wrote:
>
> On 1/23/19 12:05 PM, Jan Peter Stotz wrote:
> > like many other I ran into bug JDK-698217 which is about 
> > AbstractSet.removeAll()
> > and it's "aptitude" in wasting CPU time if you accidentally hit this bug. 
> > And
> > there are hundred of developers hitting this bug every year.
> >
> > I simply don't understand why there was no progress in 8 years, on a severe
> > performance issue (a removeAll method on an efficient set that can require
> > O(n^2)!) caused by code that was written to speed-up the removeAll 
> > implementation.
> >
> > Which makes this bug worse is that it is triggered by the relative size of 
> > the
> > current set compared to the collection passed as parameter.
> > Therefore for most developers this means not to use this buggy function at 
> > all
> > (once they realized how worse it is).
>
> I wasn't aware that hundreds of developers are hitting this bug every year. I
> haven't seen any mention of it (besides in the bug database) on Twitter, on
> Reddit, on DZone, at the conferences I attend, or in several years of
> core-libs-dev emails. Well, it was mentioned on core-libs-dev once in 2011 [1]
> although that was a suggestion for improvement, not a complaint about 
> performance.
>
> Nonetheless this is a real problem, and if it's causing difficulties we can
> certainly take a look at it.
>
> There is, however, more to the story. The ACTUAL problem is a semantic one; 
> see
> JDK-6394757. [2] Briefly, consider x.removeAll(y). Depending on the relative
> sizes of x and y, this method might end up

Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-08 Thread Tagir Valeev
Hello!

> I would argue that iterating the argument and calling remove() on "this" are 
> the
> right semantics, because you want set membership to be determined by this set,
> not by whatever collection you pass as an argument. However, I note that
> AbstractCollection.removeAll and other removeAll implementations iterate over
> "this" and do a contains() check on the argument. The revised
> AbstractSet.removeAll would be an outlier here, though it makes sense to me to
> do it this way.

For complete picture it should be noted that there's a slight
difference in remove and removeAll spec: remove removes at most one
element while removeAll removes all elements from the specified
collection.
E.g. c.removeAll(Collections.singleton(foo)) would remove all
instances of foo from c while c.remove(foo) would return only one foo.
These should be equivalent for Set where repeating elements should not
normally appear, but it's wrong for any Collection. That's why
AbstractCollection.removeAll
cannot be rewritten in the same way.

Another interesting thing is
java.util.IdentityHashMap.KeySet#removeAll implementation [1]:
it reimplements the AbstractCollection#removeAll, because of the same
reason: now
the first branch of AbstractSet#removeAll becomes incorrect. See the
comment before it:

/*
 * Must revert from AbstractSet's impl to AbstractCollection's, as
 * the former contains an optimization that results in incorrect
 * behavior when c is a smaller "normal" (non-identity-based) Set.
 */

Btw this comment should be updated to remove a "smaller" condition if
the proposed
change for AbstractSet will be implemented.

[1] 
http://hg.openjdk.java.net/jdk/jdk/file/e57bcfd7bf79/src/java.base/share/classes/java/util/IdentityHashMap.java#l990

With best regards,
Tagir Valeev


On Fri, Jan 25, 2019 at 7:11 AM Stuart Marks  wrote:
>
> On 1/23/19 12:05 PM, Jan Peter Stotz wrote:
> > like many other I ran into bug JDK-698217 which is about 
> > AbstractSet.removeAll()
> > and it's "aptitude" in wasting CPU time if you accidentally hit this bug. 
> > And
> > there are hundred of developers hitting this bug every year.
> >
> > I simply don't understand why there was no progress in 8 years, on a severe
> > performance issue (a removeAll method on an efficient set that can require
> > O(n^2)!) caused by code that was written to speed-up the removeAll 
> > implementation.
> >
> > Which makes this bug worse is that it is triggered by the relative size of 
> > the
> > current set compared to the collection passed as parameter.
> > Therefore for most developers this means not to use this buggy function at 
> > all
> > (once they realized how worse it is).
>
> I wasn't aware that hundreds of developers are hitting this bug every year. I
> haven't seen any mention of it (besides in the bug database) on Twitter, on
> Reddit, on DZone, at the conferences I attend, or in several years of
> core-libs-dev emails. Well, it was mentioned on core-libs-dev once in 2011 [1]
> although that was a suggestion for improvement, not a complaint about 
> performance.
>
> Nonetheless this is a real problem, and if it's causing difficulties we can
> certainly take a look at it.
>
> There is, however, more to the story. The ACTUAL problem is a semantic one; 
> see
> JDK-6394757. [2] Briefly, consider x.removeAll(y). Depending on the relative
> sizes of x and y, this method might end up using either x's or y's definition 
> of
> membership, which could differ from each other. (See the bug report for an
> example.) Thus the semantics of this method depend upon the relative sizes of
> the collections, which is arguably flawed.
>
> Worse, this behavior is specified to iterate this set or the argument, 
> depending
> upon their relative sizes. [3] So, fixing this will require an incompatible
> specification change.
>
> The obvious way to fix this is to get rid of the "optimizations" (that turn 
> out
> not to be optimizations at all in some cases) and replace it with a simple 
> loop:
>
>  public boolean removeAll(Collection c) {
>  Objects.requireNonNull(c);
>  boolean modified = false;
>  for (Object e : c)
>  modified |= remove(e);
>  return modified;
>  }
>
> I would argue that iterating the argument and calling remove() on "this" are 
> the
> right semantics, because you want set membership to be determined by this set,
> not by whatever collection you pass as an argument. However, I note that
> AbstractCollection.removeAll and other removeAll implementations iterate over
> "this" and do a contains() check on the argument. The revised
> AbstractSet.removeAll would be an outlier here, though it makes sense to me to
> do it this way.
>
> Is it worth the incompatibility?
>
> s'marks
>
>
>
>
> [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html
>
> [2] https://bugs.openjdk.java.net/browse/JDK-6394757
>
> [3]
> 

Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-02-02 Thread Jan Peter Stotz

Hi Stuart.

It looks like the web sites and services I use when developing and those 
you use are fundamentally different or may be we use just different 
search engines, because when I have programming problems I usually not 
end up on Twitter or Reddit or DZone. And the existence of this mailing 
list seems to be hidden to the net or it's SE ranking is so bad that I 
can't remember that I ever found a link to it in the last years (before 
knowing it's existence and explicitly searching for it).
However when I search most likely the majority of links point to 
Stackoverflow.com.


Searching Stackoverflow for something like "[Java] [performance] 
Hashset" you immediatly get to the question
https://stackoverflow.com/questions/28671903 which has 65 upvotes and a 
view count of 8,085. For a 5 years old question this is IMHO pretty 
impressive.


Also one have to keep in mind that we are talking about a performance 
problem that requires usually an experienced developer and large data 
sets before the delay becomes so apparently that you start searching for 
the reason.


Your proposed replacement for AbstractSet.removeAll() looks nice and I 
agree with you regarding the arguments you pointed out on what 
collection used as "this" and what as argument. That should be clearly a 
decision of the developer.


That the change makes AbstractSet.removeAll "an outlier" is in my 
opinion only a problem from a very high-level perspective. Only people 
not aware of the differences between all the Collection implementations 
and the reason why they exist may expect to have a removeAll() 
implementation that works identical for every Collection implementation.


However regular programmers from my experience rarely come to a point 
where they use Collection. Personally for me Collection is more or less 
a synonym for Iterable because iterating over a Collection is one of the 
few properties that work without major (potential) performance issues 
for all implementations.


Jan

Am 25.01.2019 um 01:07 schrieb Stuart Marks:
I wasn't aware that hundreds of developers are hitting this bug every 
year. I haven't seen any mention of it (besides in the bug database) on 
Twitter, on Reddit, on DZone, at the conferences I attend, or in several 
years of core-libs-dev emails. Well, it was mentioned on core-libs-dev 
once in 2011 [1] although that was a suggestion for improvement, not a 
complaint about performance.


Nonetheless this is a real problem, and if it's causing difficulties we 
can certainly take a look at it.


There is, however, more to the story. The ACTUAL problem is a semantic 
one; see JDK-6394757. [2] Briefly, consider x.removeAll(y). Depending on 
the relative sizes of x and y, this method might end up using either x's 
or y's definition of membership, which could differ from each other. 
(See the bug report for an example.) Thus the semantics of this method 
depend upon the relative sizes of the collections, which is arguably 
flawed.


Worse, this behavior is specified to iterate this set or the argument, 
depending upon their relative sizes. [3] So, fixing this will require an 
incompatible specification change.


The obvious way to fix this is to get rid of the "optimizations" (that 
turn out not to be optimizations at all in some cases) and replace it 
with a simple loop:


     public boolean removeAll(Collection c) {
     Objects.requireNonNull(c);
     boolean modified = false;
     for (Object e : c)
     modified |= remove(e);
     return modified;
     }

I would argue that iterating the argument and calling remove() on "this" 
are the right semantics, because you want set membership to be 
determined by this set, not by whatever collection you pass as an 
argument. However, I note that AbstractCollection.removeAll and other 
removeAll implementations iterate over "this" and do a contains() check 
on the argument. The revised AbstractSet.removeAll would be an outlier 
here, though it makes sense to me to do it this way.


Is it worth the incompatibility?

s'marks




[1] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html


[2] https://bugs.openjdk.java.net/browse/JDK-6394757

[3] 
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/AbstractSet.html#removeAll(java.util.Collection) 








Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-01-24 Thread Stuart Marks

On 1/23/19 12:05 PM, Jan Peter Stotz wrote:
like many other I ran into bug JDK-698217 which is about AbstractSet.removeAll() 
and it's "aptitude" in wasting CPU time if you accidentally hit this bug. And 
there are hundred of developers hitting this bug every year.


I simply don't understand why there was no progress in 8 years, on a severe 
performance issue (a removeAll method on an efficient set that can require 
O(n^2)!) caused by code that was written to speed-up the removeAll implementation.


Which makes this bug worse is that it is triggered by the relative size of the 
current set compared to the collection passed as parameter.
Therefore for most developers this means not to use this buggy function at all 
(once they realized how worse it is).


I wasn't aware that hundreds of developers are hitting this bug every year. I 
haven't seen any mention of it (besides in the bug database) on Twitter, on 
Reddit, on DZone, at the conferences I attend, or in several years of 
core-libs-dev emails. Well, it was mentioned on core-libs-dev once in 2011 [1] 
although that was a suggestion for improvement, not a complaint about performance.


Nonetheless this is a real problem, and if it's causing difficulties we can 
certainly take a look at it.


There is, however, more to the story. The ACTUAL problem is a semantic one; see 
JDK-6394757. [2] Briefly, consider x.removeAll(y). Depending on the relative 
sizes of x and y, this method might end up using either x's or y's definition of 
membership, which could differ from each other. (See the bug report for an 
example.) Thus the semantics of this method depend upon the relative sizes of 
the collections, which is arguably flawed.


Worse, this behavior is specified to iterate this set or the argument, depending 
upon their relative sizes. [3] So, fixing this will require an incompatible 
specification change.


The obvious way to fix this is to get rid of the "optimizations" (that turn out 
not to be optimizations at all in some cases) and replace it with a simple loop:


public boolean removeAll(Collection c) {
Objects.requireNonNull(c);
boolean modified = false;
for (Object e : c)
modified |= remove(e);
return modified;
}

I would argue that iterating the argument and calling remove() on "this" are the 
right semantics, because you want set membership to be determined by this set, 
not by whatever collection you pass as an argument. However, I note that 
AbstractCollection.removeAll and other removeAll implementations iterate over 
"this" and do a contains() check on the argument. The revised 
AbstractSet.removeAll would be an outlier here, though it makes sense to me to 
do it this way.


Is it worth the incompatibility?

s'marks




[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html

[2] https://bugs.openjdk.java.net/browse/JDK-6394757

[3] 
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/AbstractSet.html#removeAll(java.util.Collection)




Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-01-24 Thread Jan Peter Stotz

Hi.

Unfortunately in my last message I made mistake and mixed up the two 
sections. The main reasons for were first too many coding hours 
yesterday and second I had investigated this issue not yesterday but 
some weeks ago. During the last weeks I searched a way to participate in 
OpenJDK except for some high-traffic mailing list (that is IMHO 
development as it was in the last millennium). Especially the useless 
bug tracker (as it is only a bug viewer) is really not understandable to me.


Federico Peralta Schaffner wrote:

Now, how have you come to the conclusion that the problem is in the first
branch of the above if statement, and that it is related to the passed
collection's implementation details of the remove(Object o) method? I think
this is not the problem, as elements are removed from this set and not from
the passed collection.


You are right, the first section is not the problem but the second one 
is. I came to the same conclusion as you did when I first investigated 
this issue.


However I would not call a marker interface HashAccess as it reflects a 
technology not an implementation characteristic. E.g. a TreeSet has an 
access time of O(log(n)) which may also be sufficiently fast enough for 
the second code part of AbstractSet.removeAll() but it's implementation 
does not base on hashing.


Therefore my suggestion is still to simply check if the other collection 
is a Set and only if it is one use the second code part.


Jan

PS: If somebody answers please cc me personally as I only use the digest 
mode of this list.


Re: JDK-6982173: Small problem causing thousands of wasted CPU hours

2019-01-23 Thread Federico Peralta Schaffner
Hi Jan,

I'm amazed to see this performance problem has persisted for so long. As a
Java developer, I wasn't even aware of it, so thanks for the information.

>
> The relevant part of the current implementation is as follows:
>
>  if (size() > c.size()) {
>  for (Object e : c)
>  modified |= remove(e);
>  } else {
>  for (Iterator i = iterator(); i.hasNext(); ) {
>  if (c.contains(i.next())) {
>  i.remove();
>  modified = true;
>  }
>  }
>  }
>
> The problem is that the code in the first if block only makes sense if
> the collections passed as parameter implements the remove(Object o)
> method in an efficient way - this means with a complexity of less than
> O(n).

Now, how have you come to the conclusion that the problem is in the first
branch of the above if statement, and that it is related to the passed
collection's implementation details of the remove(Object o) method? I think
this is not the problem, as elements are removed from this set and not from
the passed collection.

In fact, the problem is in the else branch (when this set is smaller than
the passed collection). The problem is in the following line:

  if (c.contains(i.next())) {

That c.contains(...) might not be implemented in an optimized way, i.e. c
might be an ArrayList, which uses a linear search to check if the element
belongs to it.

> Java currently has no way to identify Collection implementations that
> satisfy this property (...)

Yes, you are correct. While there exists the RandomAccess marker interface
for List implementations, there is no analogous marker interface for
neither Set nor Map implementations that use hashing, i.e. something that
could have been named "HashAccess".

So I wonder and if the members of this group find it plausible to add such
HashAccess marker interface to HashSet and HashMap (and to any other
implementations that use hashing). If (as I suspect) introducing further
marker interfaces is not desirable (or has been tacitly forbidden), then
what alternatives are available and would be a good fit here?

For the record, if the HashAccess marker interface had ever existed, the
implementation could have been changed to:

  if (size() > c.size()) {
  for (Object e : c)
  modified |= remove(e);
  } else {
  Collection cc = c instanceof HashAccess ? c : new
HashSet<>(c);
  for (Iterator i = iterator(); i.hasNext(); ) {
  if (cc.contains(i.next())) {
  i.remove();
  modified = true;
  }
  }
  }

This way, the contains invocation in `if (cc.contains(i.next())) {` would
have been optimized.

Thanks,
fps.-