Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
> I think the original idea was that the "general contract" was that > collections contained elements whose membership was defined by equals(). > However, for a very long time now there have been collections in the JDK such > as SortedSet and IdentityHashMap (more precisely, IdentityHashMap's keyset). > I think we have to accept the fact that these are collections even though > their membership semantics differ from those provided by equals(). Perhaps we must accept their placement in the type hierarchy for compatibility reasons, but it seems reasonable to me to declare these exceptional classes to be nonstandard, which means that their behavior when used as a Collection might be unexpected. An alternative, perhaps, would be to add methods that describe the nature of the membership, e.g. whether based on equals/hash or a comparator (or something else?), which would then become part of the general contract. You’ll probably call this a non-starter :-) > After these are fixed, the specification and implementation will at least be > self-consistent. However, it will still be the case that HashSet, TreeSet, > and IdentityHashMap's keyset will have different membership semantics, and if > you pass them around interchangeably you might get unexpected results. I > think we just have to live with that. I’m not sure what you mean by interchangeably. A TreeSet could misbehave when passed to a TreeSet if the method assumes the “general contract” is satisfied by its parameter. Or are you adding a special-case requirement that every method on a Collection that takes a Collection parameter must behave properly when the actual argument is an instance of the same class, or something like that? > Once we've admitted to the fact that different collections can have different > membership semantics, the main point of this bug (JDK-6982173) is to fix up > the semantics of AbstractSet.removeAll() to remove the semantic shifting that > can occur based on the relative sizes of the collections. If we declare that collections can have different membership semantics and still satisfy the general contract of Collection, then we need to come up with a new specification of the general contract. However, if we specify that membership based on anything other than equals/hash is nonstandard, then this “bug” is not a bug. It is a valid implementation on the assumption that the parameter satisfies the general contract of the parameter type. The behavior that you call a semantic shift, I would call an unexpected behavior resulting from using a nonstandard collection where a Collection was expected. I also believe that the performace penalty of fixing this bug is “astonishing”. One could argue whether the performance penalty is more or less astonishing than the unexpected behavior when a non-standard collection is used as the parameter. > Introducing a new interface type is basically a non-starter. Hopefully, what you mean is that introducting a new interface type and incompatibly rearranging the type hierarchy is a non-starter. I still think it would be useful to: 1. introduce a Membership interface 2. have Collection extend Membership 3. add a method to Collection: removeAllMembers(Membership) > Objects of type Collection get passed around a lot, and users can usefully > rely on things like size(), they can iterate the Collection, etc. Except when they are implementing something like removeAll()? They can iterate but not assume that they are getting all the objects that are members of the collection, but instead they are getting representative objects? Alan > On Feb 25, 2019, at 11:06 AM, Stuart Marks wrote: > > > > On 2/15/19 11:30 AM, Alan Snyder wrote: >> I think this situation is a mess. >> The “general contract” of a Collection, I believe, is that it contains zero >> or more identified member objects, as defined by appropriate equals() method. >> I know this is hard to specify precisely, but I presume we all “know it when >> we see it”. >> There is value to “collections” whose members are not objects but are >> equivalence classes of objects, as defined by a nonstandard equality test, >> but I think it is a mistake to call them Collections. >> If a method takes a parameter of type Collection, it should be free to assume >> that the parameter object supports the “general contract” of Collection. Is >> there any plausible alternative? > > [back from vacation] > > Yes, this situation is something of a mess. > > I think the original idea was that the "general contract" was that > collections contained elements whose membership was defined by equals(). > However, for a very long time now there have been collections in the JDK such > as SortedSet and IdentityHashMap (more precisely, IdentityHashMap's keyset). > I think we have to accept the fact that these are collections even though > their membership semantics differ from those provided by equals(). > >> Changing one
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
On 2/15/19 11:30 AM, Alan Snyder wrote: I think this situation is a mess. The “general contract” of a Collection, I believe, is that it contains zero or more identified member objects, as defined by appropriate equals() method. I know this is hard to specify precisely, but I presume we all “know it when we see it”. There is value to “collections” whose members are not objects but are equivalence classes of objects, as defined by a nonstandard equality test, but I think it is a mistake to call them Collections. If a method takes a parameter of type Collection, it should be free to assume that the parameter object supports the “general contract” of Collection. Is there any plausible alternative? [back from vacation] Yes, this situation is something of a mess. I think the original idea was that the "general contract" was that collections contained elements whose membership was defined by equals(). However, for a very long time now there have been collections in the JDK such as SortedSet and IdentityHashMap (more precisely, IdentityHashMap's keyset). I think we have to accept the fact that these are collections even though their membership semantics differ from those provided by equals(). Changing one method in the JDK to support a non-standard Collection parameter does not solve the problem, because non-JDK methods with Collection (etc.) parameters could have similar “misbehavior”. How would the developer know when a specific TreeSet can or cannot be passed to a method? Does every method that accepts a Collection (etc.) parameter require boilerplate (as in the disjoint example) explaining its exact requirements or how it can go wrong? Perhaps it would be useful to define specific behaviors that these nonstandard “collections” might support. For example, a Membership interface (with method contains(e)) would be perfect for a removeAll(Membership) method on Collection, implemented as you propose. The wording on Collections.disjoint() is particularly clumsy. I'm not entirely sure what to do about it and its implementation; suggestions welcome. But I also don't think Collections.disjoint() is a very important problem, compared to what's going on in the rest of the framework. Introducing a new interface type is basically a non-starter. Objects of type Collection get passed around a lot, and users can usefully rely on things like size(), they can iterate the Collection, etc. Once we've admitted to the fact that different collections can have different membership semantics, the main point of this bug (JDK-6982173) is to fix up the semantics of AbstractSet.removeAll() to remove the semantic shifting that can occur based on the relative sizes of the collections. There's another bug JDK-8190545 that needs to be fixed at some point, which is a larger scale cleanup to clarify and remove assumptions about collection membership. It starts out from the point that SortedSet and its ilk define membership using a Comparator; yet TreeSet.contains() still talks about using equals(), which is simply incorrect. Pulling on that thread for a while reveals a bunch of places in the spec that need adjustment. See the comments in the bug report and the linked email discussions. After these are fixed, the specification and implementation will at least be self-consistent. However, it will still be the case that HashSet, TreeSet, and IdentityHashMap's keyset will have different membership semantics, and if you pass them around interchangeably you might get unexpected results. I think we just have to live with that. s'marks
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
I think this situation is a mess. The “general contract” of a Collection, I believe, is that it contains zero or more identified member objects, as defined by appropriate equals() method. I know this is hard to specify precisely, but I presume we all “know it when we see it”. There is value to “collections” whose members are not objects but are equivalence classes of objects, as defined by a nonstandard equality test, but I think it is a mistake to call them Collections. If a method takes a parameter of type Collection, it should be free to assume that the parameter object supports the “general contract” of Collection. Is there any plausible alternative? Changing one method in the JDK to support a non-standard Collection parameter does not solve the problem, because non-JDK methods with Collection (etc.) parameters could have similar “misbehavior”. How would the developer know when a specific TreeSet can or cannot be passed to a method? Does every method that accepts a Collection (etc.) parameter require boilerplate (as in the disjoint example) explaining its exact requirements or how it can go wrong? Perhaps it would be useful to define specific behaviors that these nonstandard “collections” might support. For example, a Membership interface (with method contains(e)) would be perfect for a removeAll(Membership) method on Collection, implemented as you propose. Alan > On Feb 15, 2019, at 10:44 AM, Stuart Marks wrote: > > > > On 2/14/19 9:30 AM, Alan Snyder wrote: >>> Care must be exercised if this method is used on collections that >>> do not comply with the general contract for {@code Collection}. >> So, what does this mean? Are we catering to incorrect implementations? > > I think this is a quote from the specification of Collections.disjoint(): > >> * Care must be exercised if this method is used on collections that >> * do not comply with the general contract for {@code Collection}. >> * Implementations may elect to iterate over either collection and test >> * for containment in the other collection (or to perform any equivalent >> * computation). If either collection uses a nonstandard equality test >> * (as does a {@link SortedSet} whose ordering is not compatible with >> * equals, or the key set of an {@link IdentityHashMap}), both >> * collections must use the same nonstandard equality test, or the >> * result of this method is undefined. > > (Collections.disjoint() has similar issues to removeAll/retainAll but I don't > think it's as important to fix, nor do I think it necessarily must be made > consistent with them. But that's kind of a separate discussion.) > > I dislike the phrase "general contract" because it's, well, too general. I > think it mainly refers to what I've been referring to as "contains() > semantics" or "membership semantics". The rest of the paragraph refers to > cases where a "nonstandard equality test" is used. I prefer to say that the > membership semantics of things like IdentityHashMap and SortedSet differ from > the membership semantics specified in the Set interface. > > Specifically, Set says > >sets contain no pair of elements e1 and e2 such that e1.equals(e2) > > whereas IdentityHashMap implies that it cannot contain any two keys k1 and k2 > such that k1 == k2, and SortedSet should say that it contains no pair of > elements e1 and e2 such that compare(e1, e2) == 0. > > (This last is the subject that JDK-8190545 is intended to address.) > > The upshot is that Collections.disjoint might not work as expected if it's > used on collections that have different membership semantics. > > s'marks >
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
On 2/14/19 9:30 AM, Alan Snyder wrote: Care must be exercised if this method is used on collections that do not comply with the general contract for {@code Collection}. So, what does this mean? Are we catering to incorrect implementations? I think this is a quote from the specification of Collections.disjoint(): * Care must be exercised if this method is used on collections that * do not comply with the general contract for {@code Collection}. * Implementations may elect to iterate over either collection and test * for containment in the other collection (or to perform any equivalent * computation). If either collection uses a nonstandard equality test * (as does a {@link SortedSet} whose ordering is not compatible with * equals, or the key set of an {@link IdentityHashMap}), both * collections must use the same nonstandard equality test, or the * result of this method is undefined. (Collections.disjoint() has similar issues to removeAll/retainAll but I don't think it's as important to fix, nor do I think it necessarily must be made consistent with them. But that's kind of a separate discussion.) I dislike the phrase "general contract" because it's, well, too general. I think it mainly refers to what I've been referring to as "contains() semantics" or "membership semantics". The rest of the paragraph refers to cases where a "nonstandard equality test" is used. I prefer to say that the membership semantics of things like IdentityHashMap and SortedSet differ from the membership semantics specified in the Set interface. Specifically, Set says sets contain no pair of elements e1 and e2 such that e1.equals(e2) whereas IdentityHashMap implies that it cannot contain any two keys k1 and k2 such that k1 == k2, and SortedSet should say that it contains no pair of elements e1 and e2 such that compare(e1, e2) == 0. (This last is the subject that JDK-8190545 is intended to address.) The upshot is that Collections.disjoint might not work as expected if it's used on collections that have different membership semantics. s'marks
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
On Wed, Feb 13, 2019 at 6:45 PM Stuart Marks wrote: > > If we all agree on this, I'll proceed with working up a changeset for > JDK-6394757.[3] It's time to fix this one. > Thanks for taking on the fixing of this unfixable problem. It's important to do lots of documentation/guidance work, esp. a release note. Doug and I will take care of anything that needs doing in jsr166. In the real world, if performance is important, one might want to sort both collections (or use collections that are naturally sorted), then do a two-finger walk through both collections, funneling the results into a target collection. Which is N*log(N) + M*log(M) + N + M With enough work, that might be library-fiable. Hm ... Imagine ConcurrentSkipListSet.removeAll(aNavigableSet) ... Then by two-finger through both collections, advancing using higher/ceiling, you should get something like O(min(N*log(N), M*log(M))) If the argument is not sorted, then it is likely to be worth the effort of copying it into a sorted set (TreeSet with same comparator?). Software is hard.
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
Right, I see that now. Care must be exercised if this method is used on collections that do not comply with the general contract for {@code Collection}. So, what does this mean? Are we catering to incorrect implementations? > On Feb 13, 2019, at 9:07 PM, Stuart Marks wrote: > > On 2/13/19 7:22 PM, Alan Snyder wrote: >> If we take this route, how about changing the parameter type to Iterable? > > Won't work. Where I've ended up is that we need to iterate over "this" > collection and, for each element, call contains() on the parameter. The > AbstractCollection.removeAll() implementation does something like this: > >removeAll(Collection c) { >for (Iterator it = iterator(); it.hasNext();) { >if (c.contains(it.next())) { >it.remove(); >} >} >} > > Since we call contains() on the parameter, it has to be a Collection. > > s'marks >
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
On 2/13/19 7:22 PM, Alan Snyder wrote: If we take this route, how about changing the parameter type to Iterable? Won't work. Where I've ended up is that we need to iterate over "this" collection and, for each element, call contains() on the parameter. The AbstractCollection.removeAll() implementation does something like this: removeAll(Collection c) { for (Iterator it = iterator(); it.hasNext();) { if (c.contains(it.next())) { it.remove(); } } } Since we call contains() on the parameter, it has to be a Collection. s'marks
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
If we take this route, how about changing the parameter type to Iterable? Alan > On Feb 13, 2019, at 7:12 PM, Stuart Marks wrote: > > Right, as I mentioned in my earlier emails [1][2] this is related to > JDK-6394757 [3] where the semantics shift depending on the relative sizes of > the collections. This has a complicated history. In JDK-6982173 [4] there is > a lot of discussion about what heuristics to use for iterating this vs the > arg, in order to get the best performance. This assumes that these operations > are equivalent, when they aren't! In JDK-6394757 there's some discussion > about the dubious semantics of this optimization and the potential of simply > removing it. > > At this point I think we need to remove the performance heurstic in order to > get consistent semantics. The performance won't be optimal -- indeed, it > wasn't before -- but at least it should be more predictable. > > Your example is indeed counterintuitive, as it uses the contains() semantics > of the argument -- the string length -- instead of the contains() semantics > of "this". However, it's defensible, and it's supported by the spec, so I > don't really see any alternative. > > The example set is pretty counterintuitive in the first place. For example: > > jshell> Set set = new > TreeSet<>(Comparator.comparingInt(String::length)) > jshell> set.addAll(List.of("a", "if", "the", "when")) > jshell> set > set ==> [a, if, the, when] > jshell> set.contains("foo") > $235 ==> true > > Also, this comparator isn't consistent with equals. This isn't a bug. But it > does give rise to even more counterintuitive behavior: > > jshell> var set2 = Set.of("x", "xx", "xxx", "") > jshell> set2.equals(set) > $237 ==> false > jshell> set.equals(set2) > $238 ==> true > > Fun with collections! :-) > > s'marks > > > [1] > http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058172.html > > [2] > http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058539.html > > [3] https://bugs.openjdk.java.net/browse/JDK-6394757 > > [4] https://bugs.openjdk.java.net/browse/JDK-6982173 > > > On 2/11/19 2:23 PM, Michael Rasmussen wrote: >> The current implementation seems very counter-intuitive with anything that >> doesn't have the same comparison semantics, for instance a TreeSet/Map with >> a Comparator, Identity-based Set/Map etc. >> The output of the following snippet would probably surprise most: >> /* --- snip --- */ >> import java.util.*; >> class Test { >> public static void main(String[] args) { >> TreeMap> lengthMap = new >> TreeMap<>(Comparator.comparingInt(String::length)); >> List strings = new ArrayList<>(Arrays.asList("The quick brown >> fox jumps over the lazy dog".split(" "))); >> for (String s : strings) { >> lengthMap.computeIfAbsent(s, k -> new ArrayList<>()).add(s); >> } >> Set toRemove = lengthMap.keySet(); >> System.out.println("List before: " + strings); >> System.out.println("Set to remove: " + toRemove); >> strings.removeAll(toRemove); >> System.out.println("List after:" + strings); >> } >> } >> /* --- snip --- */ >> List before: [The, quick, brown, fox, jumps, over, the, lazy, dog] >> Set to remove: [The, over, quick] >> List after:[] >> /Michael >> From: core-libs-dev on behalf of >> Tagir Valeev >> Sent: 08 February 2019 15:13 >> To: Stuart Marks >> Cc: core-libs-dev; Jan Peter Stotz >> Subject: Re: JDK-6982173: Small problem causing thousands of wasted CPU hours >> Hello! >>> I would argue that iterating the argument and calling remove() on "this" >>> are the >>> right semantics, because you want set membership to be determined by this >>> set, >>> not by whatever collection you pass as an argument. However, I note that >>> AbstractCollection.removeAll and other removeAll implementations iterate >>> over >>> "this" and do a contains() check on the argument. The revised >>> AbstractSet.removeAll would be an outlier here, though it makes sense to me >>> to >>> do it this way. >> For complete picture it should be noted that there's a slight >> difference in remove and removeAll spec: remove removes at most one >> element while removeAll removes all elements from the specified >> co
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
Right, as I mentioned in my earlier emails [1][2] this is related to JDK-6394757 [3] where the semantics shift depending on the relative sizes of the collections. This has a complicated history. In JDK-6982173 [4] there is a lot of discussion about what heuristics to use for iterating this vs the arg, in order to get the best performance. This assumes that these operations are equivalent, when they aren't! In JDK-6394757 there's some discussion about the dubious semantics of this optimization and the potential of simply removing it. At this point I think we need to remove the performance heurstic in order to get consistent semantics. The performance won't be optimal -- indeed, it wasn't before -- but at least it should be more predictable. Your example is indeed counterintuitive, as it uses the contains() semantics of the argument -- the string length -- instead of the contains() semantics of "this". However, it's defensible, and it's supported by the spec, so I don't really see any alternative. The example set is pretty counterintuitive in the first place. For example: jshell> Set set = new TreeSet<>(Comparator.comparingInt(String::length)) jshell> set.addAll(List.of("a", "if", "the", "when")) jshell> set set ==> [a, if, the, when] jshell> set.contains("foo") $235 ==> true Also, this comparator isn't consistent with equals. This isn't a bug. But it does give rise to even more counterintuitive behavior: jshell> var set2 = Set.of("x", "xx", "xxx", "") jshell> set2.equals(set) $237 ==> false jshell> set.equals(set2) $238 ==> true Fun with collections! :-) s'marks [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058172.html [2] http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058539.html [3] https://bugs.openjdk.java.net/browse/JDK-6394757 [4] https://bugs.openjdk.java.net/browse/JDK-6982173 On 2/11/19 2:23 PM, Michael Rasmussen wrote: The current implementation seems very counter-intuitive with anything that doesn't have the same comparison semantics, for instance a TreeSet/Map with a Comparator, Identity-based Set/Map etc. The output of the following snippet would probably surprise most: /* --- snip --- */ import java.util.*; class Test { public static void main(String[] args) { TreeMap> lengthMap = new TreeMap<>(Comparator.comparingInt(String::length)); List strings = new ArrayList<>(Arrays.asList("The quick brown fox jumps over the lazy dog".split(" "))); for (String s : strings) { lengthMap.computeIfAbsent(s, k -> new ArrayList<>()).add(s); } Set toRemove = lengthMap.keySet(); System.out.println("List before: " + strings); System.out.println("Set to remove: " + toRemove); strings.removeAll(toRemove); System.out.println("List after:" + strings); } } /* --- snip --- */ List before: [The, quick, brown, fox, jumps, over, the, lazy, dog] Set to remove: [The, over, quick] List after:[] /Michael From: core-libs-dev on behalf of Tagir Valeev Sent: 08 February 2019 15:13 To: Stuart Marks Cc: core-libs-dev; Jan Peter Stotz Subject: Re: JDK-6982173: Small problem causing thousands of wasted CPU hours Hello! I would argue that iterating the argument and calling remove() on "this" are the right semantics, because you want set membership to be determined by this set, not by whatever collection you pass as an argument. However, I note that AbstractCollection.removeAll and other removeAll implementations iterate over "this" and do a contains() check on the argument. The revised AbstractSet.removeAll would be an outlier here, though it makes sense to me to do it this way. For complete picture it should be noted that there's a slight difference in remove and removeAll spec: remove removes at most one element while removeAll removes all elements from the specified collection. E.g. c.removeAll(Collections.singleton(foo)) would remove all instances of foo from c while c.remove(foo) would return only one foo. These should be equivalent for Set where repeating elements should not normally appear, but it's wrong for any Collection. That's why AbstractCollection.removeAll cannot be rewritten in the same way. Another interesting thing is java.util.IdentityHashMap.KeySet#removeAll implementation [1]: it reimplements the AbstractCollection#removeAll, because of the same reason: now the first branch of AbstractSet#removeAll becomes incorrect. See the comment before it: /* * Must revert from AbstractSet's impl to AbstractCollection's, as * the former contains an optimization that results in incorrect * behavior when c is a smaller "normal" (non-identity-based) Set. */ Btw thi
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
On 2/8/19 5:13 AM, Tagir Valeev wrote: I would argue that iterating the argument and calling remove() on "this" are the right semantics, because you want set membership to be determined by this set, not by whatever collection you pass as an argument. However, I note that AbstractCollection.removeAll and other removeAll implementations iterate over "this" and do a contains() check on the argument. The revised AbstractSet.removeAll would be an outlier here, though it makes sense to me to do it this way. For complete picture it should be noted that there's a slight difference in remove and removeAll spec: remove removes at most one element while removeAll removes all elements from the specified collection. E.g. c.removeAll(Collections.singleton(foo)) would remove all instances of foo from c while c.remove(foo) would return only one foo. These should be equivalent for Set where repeating elements should not normally appear, but it's wrong for any Collection. That's why AbstractCollection.removeAll cannot be rewritten in the same way. Yes, this is a good point. This increases my concern over my earlier proposal for AbstractSet.removeAll being an "outlier". The main issue here is whose semantics for containership are used. I had proposed iterating "arg" and calling this.remove() because that makes sense in isolation. However, AbstractCollection iterates "this" and calls arg.contains(). As you note, the specification implies that it must be done this way. This conflicts with my AbstractSet proposal. In addition, there is the matter of retainAll(). You can't iterate "arg" and call this.remove(), since you need to remove elements that *aren't* present in "arg". At least, you can't without building an intermediate data structure, which seems out of bounds here. So retainAll() pretty much needs to iterate "this" and call arg.contains(). Based on this, I've changed my mind. AbstractSet.removeAll() really should do the same thing as AbstractCollection.removeAll(), namely to iterate "this" and call arg.contains(). In fact, AbstractSet can simply inherit the AbstractCollection implementation, and its specification adjusted accordingly. Unfortunately, this means that code like the following: someSet.removeAll(someList) will have O(m*n) performance, since it will repeatedly call someList.contains(). This doesn't alleviate the performance concern. However, it's at least specified and predictable, so one can work around it if necessary. Also, set1.removeAll(set2) will unconditionally use set2's semantics for containership. This is a bit counterintuitive, as one might expect set1's semantics to be used. But it needs to be this way to be consistent with Collection.removeAll(), Collection.retainAll(), and Set.retainAll(). And at least it's predictable, and it doesn't vary based on the relative sizes of the sets! Another interesting thing is java.util.IdentityHashMap.KeySet#removeAll implementation [1]: it reimplements the AbstractCollection#removeAll, because of the same reason: now the first branch of AbstractSet#removeAll becomes incorrect. See the comment before it: /* * Must revert from AbstractSet's impl to AbstractCollection's, as * the former contains an optimization that results in incorrect * behavior when c is a smaller "normal" (non-identity-based) Set. */ Btw this comment should be updated to remove a "smaller" condition if the proposed change for AbstractSet will be implemented. [1] http://hg.openjdk.java.net/jdk/jdk/file/e57bcfd7bf79/src/java.base/share/classes/java/util/IdentityHashMap.java#l990 Also a good point. In fact, with my current thinking, this comment and the override can be removed, since it can be inherited from AbstractCollection. (This also applies to EntrySet later in the same file.) ** I also note that ConcurrentHashMap's CollectionView nested class [1] has a removeAll() implementation that uses a similar (but slightly different) heuristic for deciding whether to iterate "this" or "arg". I think this heuristic should be removed, since it suffers from the same semantic shift depending on relative sizes. I further note that ConcurrentSkipListSet.removeAll() overrides AbstractSet.removeAll() in order to avoid an expensive call to size(). But it always iterates its arg and removes from this, which is backwards from everywhere else. (But at least it doesn't do any size-based "optimization".) I think this is simply a bug. I've filed JDK-8218945 [2] to cover this. ** If we all agree on this, I'll proceed with working up a changeset for JDK-6394757.[3] It's time to fix this one. s'marks [1] http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/java/util/concurrent/ConcurrentHashMap.java#l4538 [2] https://bugs.openjdk.java.net/browse/JDK-8218945 [3] https://bugs.openjdk.java.net/browse/JDK-6394757
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
The current implementation seems very counter-intuitive with anything that doesn't have the same comparison semantics, for instance a TreeSet/Map with a Comparator, Identity-based Set/Map etc. The output of the following snippet would probably surprise most: /* --- snip --- */ import java.util.*; class Test { public static void main(String[] args) { TreeMap> lengthMap = new TreeMap<>(Comparator.comparingInt(String::length)); List strings = new ArrayList<>(Arrays.asList("The quick brown fox jumps over the lazy dog".split(" "))); for (String s : strings) { lengthMap.computeIfAbsent(s, k -> new ArrayList<>()).add(s); } Set toRemove = lengthMap.keySet(); System.out.println("List before: " + strings); System.out.println("Set to remove: " + toRemove); strings.removeAll(toRemove); System.out.println("List after:" + strings); } } /* --- snip --- */ List before: [The, quick, brown, fox, jumps, over, the, lazy, dog] Set to remove: [The, over, quick] List after:[] /Michael From: core-libs-dev on behalf of Tagir Valeev Sent: 08 February 2019 15:13 To: Stuart Marks Cc: core-libs-dev; Jan Peter Stotz Subject: Re: JDK-6982173: Small problem causing thousands of wasted CPU hours Hello! > I would argue that iterating the argument and calling remove() on "this" are > the > right semantics, because you want set membership to be determined by this set, > not by whatever collection you pass as an argument. However, I note that > AbstractCollection.removeAll and other removeAll implementations iterate over > "this" and do a contains() check on the argument. The revised > AbstractSet.removeAll would be an outlier here, though it makes sense to me to > do it this way. For complete picture it should be noted that there's a slight difference in remove and removeAll spec: remove removes at most one element while removeAll removes all elements from the specified collection. E.g. c.removeAll(Collections.singleton(foo)) would remove all instances of foo from c while c.remove(foo) would return only one foo. These should be equivalent for Set where repeating elements should not normally appear, but it's wrong for any Collection. That's why AbstractCollection.removeAll cannot be rewritten in the same way. Another interesting thing is java.util.IdentityHashMap.KeySet#removeAll implementation [1]: it reimplements the AbstractCollection#removeAll, because of the same reason: now the first branch of AbstractSet#removeAll becomes incorrect. See the comment before it: /* * Must revert from AbstractSet's impl to AbstractCollection's, as * the former contains an optimization that results in incorrect * behavior when c is a smaller "normal" (non-identity-based) Set. */ Btw this comment should be updated to remove a "smaller" condition if the proposed change for AbstractSet will be implemented. [1] http://hg.openjdk.java.net/jdk/jdk/file/e57bcfd7bf79/src/java.base/share/classes/java/util/IdentityHashMap.java#l990 With best regards, Tagir Valeev On Fri, Jan 25, 2019 at 7:11 AM Stuart Marks wrote: > > On 1/23/19 12:05 PM, Jan Peter Stotz wrote: > > like many other I ran into bug JDK-698217 which is about > > AbstractSet.removeAll() > > and it's "aptitude" in wasting CPU time if you accidentally hit this bug. > > And > > there are hundred of developers hitting this bug every year. > > > > I simply don't understand why there was no progress in 8 years, on a severe > > performance issue (a removeAll method on an efficient set that can require > > O(n^2)!) caused by code that was written to speed-up the removeAll > > implementation. > > > > Which makes this bug worse is that it is triggered by the relative size of > > the > > current set compared to the collection passed as parameter. > > Therefore for most developers this means not to use this buggy function at > > all > > (once they realized how worse it is). > > I wasn't aware that hundreds of developers are hitting this bug every year. I > haven't seen any mention of it (besides in the bug database) on Twitter, on > Reddit, on DZone, at the conferences I attend, or in several years of > core-libs-dev emails. Well, it was mentioned on core-libs-dev once in 2011 [1] > although that was a suggestion for improvement, not a complaint about > performance. > > Nonetheless this is a real problem, and if it's causing difficulties we can > certainly take a look at it. > > There is, however, more to the story. The ACTUAL problem is a semantic one; > see > JDK-6394757. [2] Briefly, consider x.removeAll(y). Depending on the relative > sizes of x and y, this method might end up
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
Hello! > I would argue that iterating the argument and calling remove() on "this" are > the > right semantics, because you want set membership to be determined by this set, > not by whatever collection you pass as an argument. However, I note that > AbstractCollection.removeAll and other removeAll implementations iterate over > "this" and do a contains() check on the argument. The revised > AbstractSet.removeAll would be an outlier here, though it makes sense to me to > do it this way. For complete picture it should be noted that there's a slight difference in remove and removeAll spec: remove removes at most one element while removeAll removes all elements from the specified collection. E.g. c.removeAll(Collections.singleton(foo)) would remove all instances of foo from c while c.remove(foo) would return only one foo. These should be equivalent for Set where repeating elements should not normally appear, but it's wrong for any Collection. That's why AbstractCollection.removeAll cannot be rewritten in the same way. Another interesting thing is java.util.IdentityHashMap.KeySet#removeAll implementation [1]: it reimplements the AbstractCollection#removeAll, because of the same reason: now the first branch of AbstractSet#removeAll becomes incorrect. See the comment before it: /* * Must revert from AbstractSet's impl to AbstractCollection's, as * the former contains an optimization that results in incorrect * behavior when c is a smaller "normal" (non-identity-based) Set. */ Btw this comment should be updated to remove a "smaller" condition if the proposed change for AbstractSet will be implemented. [1] http://hg.openjdk.java.net/jdk/jdk/file/e57bcfd7bf79/src/java.base/share/classes/java/util/IdentityHashMap.java#l990 With best regards, Tagir Valeev On Fri, Jan 25, 2019 at 7:11 AM Stuart Marks wrote: > > On 1/23/19 12:05 PM, Jan Peter Stotz wrote: > > like many other I ran into bug JDK-698217 which is about > > AbstractSet.removeAll() > > and it's "aptitude" in wasting CPU time if you accidentally hit this bug. > > And > > there are hundred of developers hitting this bug every year. > > > > I simply don't understand why there was no progress in 8 years, on a severe > > performance issue (a removeAll method on an efficient set that can require > > O(n^2)!) caused by code that was written to speed-up the removeAll > > implementation. > > > > Which makes this bug worse is that it is triggered by the relative size of > > the > > current set compared to the collection passed as parameter. > > Therefore for most developers this means not to use this buggy function at > > all > > (once they realized how worse it is). > > I wasn't aware that hundreds of developers are hitting this bug every year. I > haven't seen any mention of it (besides in the bug database) on Twitter, on > Reddit, on DZone, at the conferences I attend, or in several years of > core-libs-dev emails. Well, it was mentioned on core-libs-dev once in 2011 [1] > although that was a suggestion for improvement, not a complaint about > performance. > > Nonetheless this is a real problem, and if it's causing difficulties we can > certainly take a look at it. > > There is, however, more to the story. The ACTUAL problem is a semantic one; > see > JDK-6394757. [2] Briefly, consider x.removeAll(y). Depending on the relative > sizes of x and y, this method might end up using either x's or y's definition > of > membership, which could differ from each other. (See the bug report for an > example.) Thus the semantics of this method depend upon the relative sizes of > the collections, which is arguably flawed. > > Worse, this behavior is specified to iterate this set or the argument, > depending > upon their relative sizes. [3] So, fixing this will require an incompatible > specification change. > > The obvious way to fix this is to get rid of the "optimizations" (that turn > out > not to be optimizations at all in some cases) and replace it with a simple > loop: > > public boolean removeAll(Collection c) { > Objects.requireNonNull(c); > boolean modified = false; > for (Object e : c) > modified |= remove(e); > return modified; > } > > I would argue that iterating the argument and calling remove() on "this" are > the > right semantics, because you want set membership to be determined by this set, > not by whatever collection you pass as an argument. However, I note that > AbstractCollection.removeAll and other removeAll implementations iterate over > "this" and do a contains() check on the argument. The revised > AbstractSet.removeAll would be an outlier here, though it makes sense to me to > do it this way. > > Is it worth the incompatibility? > > s'marks > > > > > [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html > > [2] https://bugs.openjdk.java.net/browse/JDK-6394757 > > [3] >
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
Hi Stuart. It looks like the web sites and services I use when developing and those you use are fundamentally different or may be we use just different search engines, because when I have programming problems I usually not end up on Twitter or Reddit or DZone. And the existence of this mailing list seems to be hidden to the net or it's SE ranking is so bad that I can't remember that I ever found a link to it in the last years (before knowing it's existence and explicitly searching for it). However when I search most likely the majority of links point to Stackoverflow.com. Searching Stackoverflow for something like "[Java] [performance] Hashset" you immediatly get to the question https://stackoverflow.com/questions/28671903 which has 65 upvotes and a view count of 8,085. For a 5 years old question this is IMHO pretty impressive. Also one have to keep in mind that we are talking about a performance problem that requires usually an experienced developer and large data sets before the delay becomes so apparently that you start searching for the reason. Your proposed replacement for AbstractSet.removeAll() looks nice and I agree with you regarding the arguments you pointed out on what collection used as "this" and what as argument. That should be clearly a decision of the developer. That the change makes AbstractSet.removeAll "an outlier" is in my opinion only a problem from a very high-level perspective. Only people not aware of the differences between all the Collection implementations and the reason why they exist may expect to have a removeAll() implementation that works identical for every Collection implementation. However regular programmers from my experience rarely come to a point where they use Collection. Personally for me Collection is more or less a synonym for Iterable because iterating over a Collection is one of the few properties that work without major (potential) performance issues for all implementations. Jan Am 25.01.2019 um 01:07 schrieb Stuart Marks: I wasn't aware that hundreds of developers are hitting this bug every year. I haven't seen any mention of it (besides in the bug database) on Twitter, on Reddit, on DZone, at the conferences I attend, or in several years of core-libs-dev emails. Well, it was mentioned on core-libs-dev once in 2011 [1] although that was a suggestion for improvement, not a complaint about performance. Nonetheless this is a real problem, and if it's causing difficulties we can certainly take a look at it. There is, however, more to the story. The ACTUAL problem is a semantic one; see JDK-6394757. [2] Briefly, consider x.removeAll(y). Depending on the relative sizes of x and y, this method might end up using either x's or y's definition of membership, which could differ from each other. (See the bug report for an example.) Thus the semantics of this method depend upon the relative sizes of the collections, which is arguably flawed. Worse, this behavior is specified to iterate this set or the argument, depending upon their relative sizes. [3] So, fixing this will require an incompatible specification change. The obvious way to fix this is to get rid of the "optimizations" (that turn out not to be optimizations at all in some cases) and replace it with a simple loop: public boolean removeAll(Collection c) { Objects.requireNonNull(c); boolean modified = false; for (Object e : c) modified |= remove(e); return modified; } I would argue that iterating the argument and calling remove() on "this" are the right semantics, because you want set membership to be determined by this set, not by whatever collection you pass as an argument. However, I note that AbstractCollection.removeAll and other removeAll implementations iterate over "this" and do a contains() check on the argument. The revised AbstractSet.removeAll would be an outlier here, though it makes sense to me to do it this way. Is it worth the incompatibility? s'marks [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html [2] https://bugs.openjdk.java.net/browse/JDK-6394757 [3] https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/AbstractSet.html#removeAll(java.util.Collection)
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
On 1/23/19 12:05 PM, Jan Peter Stotz wrote: like many other I ran into bug JDK-698217 which is about AbstractSet.removeAll() and it's "aptitude" in wasting CPU time if you accidentally hit this bug. And there are hundred of developers hitting this bug every year. I simply don't understand why there was no progress in 8 years, on a severe performance issue (a removeAll method on an efficient set that can require O(n^2)!) caused by code that was written to speed-up the removeAll implementation. Which makes this bug worse is that it is triggered by the relative size of the current set compared to the collection passed as parameter. Therefore for most developers this means not to use this buggy function at all (once they realized how worse it is). I wasn't aware that hundreds of developers are hitting this bug every year. I haven't seen any mention of it (besides in the bug database) on Twitter, on Reddit, on DZone, at the conferences I attend, or in several years of core-libs-dev emails. Well, it was mentioned on core-libs-dev once in 2011 [1] although that was a suggestion for improvement, not a complaint about performance. Nonetheless this is a real problem, and if it's causing difficulties we can certainly take a look at it. There is, however, more to the story. The ACTUAL problem is a semantic one; see JDK-6394757. [2] Briefly, consider x.removeAll(y). Depending on the relative sizes of x and y, this method might end up using either x's or y's definition of membership, which could differ from each other. (See the bug report for an example.) Thus the semantics of this method depend upon the relative sizes of the collections, which is arguably flawed. Worse, this behavior is specified to iterate this set or the argument, depending upon their relative sizes. [3] So, fixing this will require an incompatible specification change. The obvious way to fix this is to get rid of the "optimizations" (that turn out not to be optimizations at all in some cases) and replace it with a simple loop: public boolean removeAll(Collection c) { Objects.requireNonNull(c); boolean modified = false; for (Object e : c) modified |= remove(e); return modified; } I would argue that iterating the argument and calling remove() on "this" are the right semantics, because you want set membership to be determined by this set, not by whatever collection you pass as an argument. However, I note that AbstractCollection.removeAll and other removeAll implementations iterate over "this" and do a contains() check on the argument. The revised AbstractSet.removeAll would be an outlier here, though it makes sense to me to do it this way. Is it worth the incompatibility? s'marks [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html [2] https://bugs.openjdk.java.net/browse/JDK-6394757 [3] https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/AbstractSet.html#removeAll(java.util.Collection)
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
Hi. Unfortunately in my last message I made mistake and mixed up the two sections. The main reasons for were first too many coding hours yesterday and second I had investigated this issue not yesterday but some weeks ago. During the last weeks I searched a way to participate in OpenJDK except for some high-traffic mailing list (that is IMHO development as it was in the last millennium). Especially the useless bug tracker (as it is only a bug viewer) is really not understandable to me. Federico Peralta Schaffner wrote: Now, how have you come to the conclusion that the problem is in the first branch of the above if statement, and that it is related to the passed collection's implementation details of the remove(Object o) method? I think this is not the problem, as elements are removed from this set and not from the passed collection. You are right, the first section is not the problem but the second one is. I came to the same conclusion as you did when I first investigated this issue. However I would not call a marker interface HashAccess as it reflects a technology not an implementation characteristic. E.g. a TreeSet has an access time of O(log(n)) which may also be sufficiently fast enough for the second code part of AbstractSet.removeAll() but it's implementation does not base on hashing. Therefore my suggestion is still to simply check if the other collection is a Set and only if it is one use the second code part. Jan PS: If somebody answers please cc me personally as I only use the digest mode of this list.
Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
Hi Jan, I'm amazed to see this performance problem has persisted for so long. As a Java developer, I wasn't even aware of it, so thanks for the information. > > The relevant part of the current implementation is as follows: > > if (size() > c.size()) { > for (Object e : c) > modified |= remove(e); > } else { > for (Iterator i = iterator(); i.hasNext(); ) { > if (c.contains(i.next())) { > i.remove(); > modified = true; > } > } > } > > The problem is that the code in the first if block only makes sense if > the collections passed as parameter implements the remove(Object o) > method in an efficient way - this means with a complexity of less than > O(n). Now, how have you come to the conclusion that the problem is in the first branch of the above if statement, and that it is related to the passed collection's implementation details of the remove(Object o) method? I think this is not the problem, as elements are removed from this set and not from the passed collection. In fact, the problem is in the else branch (when this set is smaller than the passed collection). The problem is in the following line: if (c.contains(i.next())) { That c.contains(...) might not be implemented in an optimized way, i.e. c might be an ArrayList, which uses a linear search to check if the element belongs to it. > Java currently has no way to identify Collection implementations that > satisfy this property (...) Yes, you are correct. While there exists the RandomAccess marker interface for List implementations, there is no analogous marker interface for neither Set nor Map implementations that use hashing, i.e. something that could have been named "HashAccess". So I wonder and if the members of this group find it plausible to add such HashAccess marker interface to HashSet and HashMap (and to any other implementations that use hashing). If (as I suspect) introducing further marker interfaces is not desirable (or has been tacitly forbidden), then what alternatives are available and would be a good fit here? For the record, if the HashAccess marker interface had ever existed, the implementation could have been changed to: if (size() > c.size()) { for (Object e : c) modified |= remove(e); } else { Collection cc = c instanceof HashAccess ? c : new HashSet<>(c); for (Iterator i = iterator(); i.hasNext(); ) { if (cc.contains(i.next())) { i.remove(); modified = true; } } } This way, the contains invocation in `if (cc.contains(i.next())) {` would have been optimized. Thanks, fps.-