### Re: [Haskell-cafe] ordNub

Niklas Hambüchen mail at nh2.me writes: In sets, the order does not matter, while for nub it does. Let's be careful here!. Niklas, when you say order, do you mean: * the _ordering_ from the Ord instance? Or * the relative sequence of elements in the list? ... the fact that Set is used inside my proposed ordNub implementation is a detail not visible to the caller. If you use the Set library, that fact may be very visible! Because Set re-sequences the whole list, as per its Ord instance. But List.nub preserves the list sequence (except for omitting duplicates). Furthermore, the Ord instance might compare two elements as EQ, even though their Eq instance says they're not equal. So a Set-based ordNub could end up returning: * not the same elements as List.nub * and/or not in the same list sequence I'd call that very much *visible* to the caller. That's why it looks like a Data.List function to me. [BTW I am still less than convinced that overall a Set-based ordNub is significantly more efficient. I suspect it depends on how big is your list.] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On 13/10/13 21:42, AntC wrote: Niklas Hambüchen mail at nh2.me writes: In sets, the order does not matter, while for nub it does. Let's be careful here!. Niklas, when you say order, do you mean: * the _ordering_ from the Ord instance? Or * the relative sequence of elements in the list? ... the fact that Set is used inside my proposed ordNub implementation is a detail not visible to the caller. If you use the Set library, that fact may be very visible! Because Set re-sequences the whole list, as per its Ord instance. But List.nub preserves the list sequence (except for omitting duplicates). I mean *exactly* what you say here. ordNub behaves has the same behaviour as nub, while (Set.toList . Set.fromList) doesn't. [BTW I am still less than convinced that overall a Set-based ordNub is significantly more efficient. I suspect it depends on how big is your list.] What do you mean? ordNub is clearly in a different complexity class, and the benchmarks that I provided show not only this, but also that ordNub is *always* faster than nub, even for singleton lists. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

Niklas Hambüchen mail at nh2.me writes: On 13/10/13 21:42, AntC wrote: ... If you use the Set library, that fact may be very visible! Because Set re-sequences the whole list, as per its Ord instance. But List.nub preserves the list sequence (except for omitting duplicates). I mean *exactly* what you say here. ordNub behaves has the same behaviour as nub, while (Set.toList . Set.fromList) doesn't. That's great, thank you. [BTW I am still less than convinced that overall a Set-based ordNub is significantly more efficient. I suspect it depends on how big is your list.] What do you mean? ordNub is clearly in a different complexity class, ... Yes, I'm not disputing that. ... and the benchmarks that I provided show not only this, but also that ordNub is *always* faster than nub, even for singleton lists. Thanks Niklas, I hadn't spotted those benchmarks back in July. I'm surprised at that result for singletons (and for very small numbers of elements which are in fact each different). Especially because List's `nub` uses `filter` == fold, which should be tail-recursive. It seems to me that for small numbers, the Set-based approach still requires comparing each element to each other. Plus there's the overhead for building the Set and inserting each element into it -- where `insert` again walks the Set to find the insertion point. Then here's a further possible optimisation, instead of making separate calls to `member` and `insert`: * Make a single call to insert' :: (Ord a) = a - Set a - (Bool, Set a) * The Bool returns True if already a member. * Else returns an updated Set in the snd, with the element inserted. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On 14/10/13 03:20, AntC wrote: Thanks Niklas, I hadn't spotted those benchmarks back in July. No worries :) I'm surprised at that result for singletons (and for very small numbers of elements which are in fact each different). I think one of the main reasons for the performance difference is that a list node and a Set binary tree node have pretty much the same performance, with the difference that in http://hackage.haskell.org/package/containers-0.5.2.1/docs/src/Data-Set-Base.html data Set a = Bin {-# UNPACK #-} !Size !a !(Set a) !(Set a) | Tip there are strictness and unpack annotations, while for data [a] = [] | a : [a] -- pseudo syntax there are not. Good for us in this case, I guess. It seems to me that for small numbers, the Set-based approach still requires comparing each element to each other. This I don't understand. Then here's a further possible optimisation, instead of making separate calls to `member` and `insert`: This I understand again. Where do you get insert' from? containers doesn't seem to have it. Do you suggest adding it? Niklas ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

Niklas Hambüchen mail at nh2.me writes: On 14/10/13 03:20, AntC wrote: ... Then here's a further possible optimisation, instead of making separate calls to `member` and `insert`: This I understand again. Where do you get insert' from? containers doesn't seem to have it. Do you suggest adding it? err, well I didn't have any specific library in mind. More there's a kind of 'folk idiom' for managing data structures, (this applies more for imperative code/update-in-situ than functional) that if you know the next thing you're going to do after failing to find an element is insert it, you might as well get on with the insert there and then. (It's a higher-level analogue of a machine instruction decrement-and- branch-if-zero.) I'm looking at all the remarks about managing libraries and dependencies. Would it make sense to build a stand-alone version of Set purely to support ordNub? Then it needs only methods `empty` and `insertIfAbsent`. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

I would like to come back to the original question: How can ordNub be added to base? I guess we agree that Data.List is the right module for a function of type Ord a = [a] - [a], but this introduces * a cyclic dependency between Data.List and Data.Set * a base dependency on containers. What is the right way to go with that? Should ordNub be introduced as part of Data.Set, as Conrad suggested? It does not really have anything to do with Set, apart from being implemented with it. On 14/07/13 14:12, Roman Cheplyaka wrote: Something like that should definitely be included in Data.List. Thanks for working on it. Roman ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On Oct 12, 2013, at 2:47 PM, Niklas Hambüchen m...@nh2.me wrote: I would like to come back to the original question: How can ordNub be added to base? I guess we agree that Data.List is the right module for a function of type Ord a = [a] - [a], but this introduces * a cyclic dependency between Data.List and Data.Set * a base dependency on containers. What is the right way to go with that? Should ordNub be introduced as part of Data.Set, as Conrad suggested? It does not really have anything to do with Set, apart from being implemented with it. I think nub's behavior is rather set-related, so I don't really understand the objection to putting it in Data.Set. Anthony On 14/07/13 14:12, Roman Cheplyaka wrote: Something like that should definitely be included in Data.List. Thanks for working on it. Roman ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On 12/10/13 20:43, Anthony Cowley wrote: I think nub's behavior is rather set-related, so I don't really understand the objection to putting it in Data.Set. In sets, the order does not matter, while for nub it does. nub:: Eq a = [a] - [a] ordNub :: Ord a = [a] - [a] both do not mention Set, and the fact that Set is used inside my proposed ordNub implementation is a detail not visible to the caller. That's why it looks like a Data.List function to me. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

* Anthony Cowley acow...@seas.upenn.edu [2013-10-12 15:43:57-0400] On Oct 12, 2013, at 2:47 PM, Niklas Hambüchen m...@nh2.me wrote: I would like to come back to the original question: How can ordNub be added to base? I guess we agree that Data.List is the right module for a function of type Ord a = [a] - [a], but this introduces * a cyclic dependency between Data.List and Data.Set * a base dependency on containers. What is the right way to go with that? Should ordNub be introduced as part of Data.Set, as Conrad suggested? It does not really have anything to do with Set, apart from being implemented with it. I think nub's behavior is rather set-related, so I don't really understand the objection to putting it in Data.Set. It's not Set (in the data structure sense) related. It's list-related, because it clearly acts on lists. Therefore, it belongs to Data.List. Besides, we already have the precedent of the slow nub being in Data.List. Roman signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On 14/07/13 20:20, Niklas Hambüchen wrote: As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub). GHC uses nub. Also let me stress again that the n² case happens even if there are no duplicates. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

Richard A. O'Keefe ok at cs.otago.ac.nz writes: There are at least four different things that an Ord version might mean: - first sort a list, then eliminate duplicates - sort a list eliminating duplicates stably as you go (think 'merge sort', using 'union' instead of 'merge') - build a balanced tree set as you go - having a list that is already sorted, use that to eliminated duplicates cheaply. These things have different costs. For example, ... What I want is more often ordNubBy than ordNub, though. (ordNubBy you can get via a suitable Ord instance for the element type?) The bigger problem is that you might not have a suitable Ord instance. After all, sets are defined by equality/equivalence relation, not necessarily by Ord. There are many other things you might want to do with a set/collection than just remove duplicates. I notice that Data.Set.map = fromList . (map stuff) . toList That is, build two lists (to be GC'd), as well as the set result. So does the performane cost of from/to List outrun the benefit of Data.Set.union? Depends how much you're mapping vs inserting and checking membership. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

Francesco Mazzoli f...@mazzo.li writes: import qualified Data.HashSet as S nub :: Hashable a = [a] - [a] nub = S.toList . S.fromList Well, the above is not stable while Niklas’ is. But I guess that’s not the point of your message :). We could also implement Data.BloomFilter.nub, which removes equal elements probabilistically (with a small but non-zero chance of removing some unique elements) :-) -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On 14.07.2013 13:20, Niklas Hambüchen wrote: I've taken the Ord-based O(n * log n) implementation from yi using a Set: ordNub :: (Ord a) = [a] - [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs (The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?) I cannot say whether this should happen, but your code about can be straight-forwardly refactored using a *Reader* monad. import Control.Monad.Reader import Data.Functor (($)) import qualified Data.Set as Set -- ifM still not in Control.Monad ifM mc md me = do { c - mc; if c then md else me } ordNub :: (Ord a) = [a] - [a] ordNub l = runReader (go l) Set.empty where go [] = return [] go (x:xs) = ifM ((x `Set.member`) $ ask) (go xs) ((x :) $ local (Set.insert x) (go xs)) test = ordNub [1,2,4,1,3,5,2,3,4,5,6,1] Of course, this does not lend itself to an application of filterM. In fact, your implementation is already in the (Set a -) reader monad, in normalized form. It looks already optimal to me. Cheers, Andreas -- Andreas AbelDu bist der geliebte Mensch. Theoretical Computer Science, University of Munich Oettingenstr. 67, D-80538 Munich, GERMANY andreas.a...@ifi.lmu.de http://www2.tcs.ifi.lmu.de/~abel/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

Apologies. I was being lazy. Here's a stable version: import qualified Data.HashSet as S hashNub :: (Ord a) = [a] - [a] hashNub l = go S.empty l where go _ [] = [] go s (x:xs) = if x `S.member` s then go s xs else x : go (S.insert x s) xs Which, again, will probably be faster than the one using Ord, and I can't think of any cases where I'd want the one using Ord instead. I may just not be creative enough, though. - Clark On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery allber...@gmail.com wrote: On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel cgae...@uwaterloo.ca wrote: Oops sorry I guess my point wasn't clear. Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a Did the point about stable fly overhead? -- brandon s allbery kf8nh sine nomine associates allber...@gmail.com ballb...@sinenomine.net unix, openafs, kerberos, infrastructure, xmonadhttp://sinenomine.net ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

Hey Jason, would you mind giving a short idea of what the point of Bird's implementation is / from what properties it is derived? Also, running the QuickCheck tests you added, it doesn't give the same output (order) as nub. On 15/07/13 13:26, Jason Dagit wrote: Richard Bird has a book, Pearls of Functional Algorithm Design that is meant to teach a form of deriving algorithms from the properties we ask of them. In this book, he gives a possible derivation of ordNub, simply called nub in the book, following the methodology he is teaching. He notes in the text that this derivation feels more complicated than it ought. Here is his version: http://lpaste.net/87625 I just sent you a pull request to add that one and S.toList . S.fromList that was suggested in this thread. I don't think those two implementations are faster than the others but it's nice to have them for completeness. Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel cgae...@uwaterloo.ca wrote: Oops sorry I guess my point wasn't clear. Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a Did the point about stable fly overhead? -- brandon s allbery kf8nh sine nomine associates allber...@gmail.com ballb...@sinenomine.net unix, openafs, kerberos, infrastructure, xmonadhttp://sinenomine.net ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On Sun, Jul 14, 2013 at 4:20 AM, Niklas Hambüchen m...@nh2.me wrote: tldr: nub is abnormally slow, we shouldn't use it, but we do. As you might know, Data.List.nub is O(n²). (*) As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub). I've taken the Ord-based O(n * log n) implementation from yi using a Set: ordNub :: (Ord a) = [a] - [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs and put benchmarks on http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html (compare `nub` vs `ordNub`). Richard Bird has a book, Pearls of Functional Algorithm Design that is meant to teach a form of deriving algorithms from the properties we ask of them. In this book, he gives a possible derivation of ordNub, simply called nub in the book, following the methodology he is teaching. He notes in the text that this derivation feels more complicated than it ought. Here is his version: http://lpaste.net/87625 I just sent you a pull request to add that one and S.toList . S.fromList that was suggested in this thread. I don't think those two implementations are faster than the others but it's nice to have them for completeness. Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than the current nub. Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)? On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel cgae...@uwaterloo.ca wrote: Apologies. I was being lazy. Here's a stable version: import qualified Data.HashSet as S hashNub :: (Ord a) = [a] - [a] hashNub l = go S.empty l where go _ [] = [] go s (x:xs) = if x `S.member` s then go s xs else x : go (S.insert x s) xs Which, again, will probably be faster than the one using Ord, and I can't think of any cases where I'd want the one using Ord instead. I may just not be creative enough, though. - Clark On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery allber...@gmail.com wrote: On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel cgae...@uwaterloo.ca wrote: Oops sorry I guess my point wasn't clear. Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a Did the point about stable fly overhead? -- brandon s allbery kf8nh sine nomine associates allber...@gmail.com ballb...@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On 16 July 2013 11:46, John Lato jwl...@gmail.com wrote: In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than the current nub. Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)? If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/ On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel cgae...@uwaterloo.ca wrote: Apologies. I was being lazy. Here's a stable version: import qualified Data.HashSet as S hashNub :: (Ord a) = [a] - [a] hashNub l = go S.empty l where go _ [] = [] go s (x:xs) = if x `S.member` s then go s xs else x : go (S.insert x s) xs Which, again, will probably be faster than the one using Ord, and I can't think of any cases where I'd want the one using Ord instead. I may just not be creative enough, though. - Clark On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery allber...@gmail.com wrote: On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel cgae...@uwaterloo.ca wrote: Oops sorry I guess my point wasn't clear. Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a Did the point about stable fly overhead? -- brandon s allbery kf8nh sine nomine associates allber...@gmail.com ballb...@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com http://IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On Tue, Jul 16, 2013 at 10:31 AM, Ivan Lazar Miljenovic ivan.miljeno...@gmail.com wrote: On 16 July 2013 11:46, John Lato jwl...@gmail.com wrote: In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than the current nub. Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)? If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/ That was one of the points up for discussion: is it worth including a subset of Set functionality to enable a much better nub in base? Is it even worth having Data.List.nub if it has quadratic complexity? As an alternative, Bart's proposal was for both including ordNub in containers and an improved nub (with no dependencies outside base) in Data.List. Unfortunately the patches are quite old (darcs format), and I don't know how they'd apply to the current situation. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

I'm procrastinating something else, so I wrote the patch to unordered-containers. Feel free to comment on the github link: https://github.com/tibbe/unordered-containers/pull/67 I'm still against having an Ord version, since my intuition tells me that hash-based data structures are faster than ordered ones. Someone else can write the patch, though! As a tangent, can anyone think of a data structure for which you can write an Ord instance but Hashable/Eq is impossible (or prove otherwise)? How about the converse? Regards, - Clark On Mon, Jul 15, 2013 at 10:40 PM, John Lato jwl...@gmail.com wrote: On Tue, Jul 16, 2013 at 10:31 AM, Ivan Lazar Miljenovic ivan.miljeno...@gmail.com wrote: On 16 July 2013 11:46, John Lato jwl...@gmail.com wrote: In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than the current nub. Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)? If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/ That was one of the points up for discussion: is it worth including a subset of Set functionality to enable a much better nub in base? Is it even worth having Data.List.nub if it has quadratic complexity? As an alternative, Bart's proposal was for both including ordNub in containers and an improved nub (with no dependencies outside base) in Data.List. Unfortunately the patches are quite old (darcs format), and I don't know how they'd apply to the current situation. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On Mon, Jul 15, 2013 at 10:31 PM, Ivan Lazar Miljenovic ivan.miljeno...@gmail.com wrote: If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/ As I understand it, currently we have a double proposal: simple ordNub in Data.List without external dependencies, and the other one in containers and/or unordered-containers as appropriate. -- brandon s allbery kf8nh sine nomine associates allber...@gmail.com ballb...@sinenomine.net unix, openafs, kerberos, infrastructure, xmonadhttp://sinenomine.net ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On 16/07/2013, at 3:21 PM, Clark Gaebel wrote: I'm still against having an Ord version, since my intuition tells me that hash-based data structures are faster than ordered ones. There are at least four different things that an Ord version might mean: - first sort a list, then eliminate duplicates - sort a list eliminating duplicates stably as you go (think 'merge sort', using 'union' instead of 'merge') - build a balanced tree set as you go - having a list that is already sorted, use that to eliminated duplicates cheaply. These things have different costs. For example, if there are N elements of which U are unique, the first as O(N.log N) cost, the third has O(N.log U) cost, and the fourth has O(N) cost. What I want is more often ordNubBy than ordNub, though. Someone else can write the patch, though! As a tangent, can anyone think of a data structure for which you can write an Ord instance but Hashable/Eq is impossible (or prove otherwise)? How about the converse? Since Ord has Eq as a superclass, and since 0 is a functionally correct hash value for anything, if you can implement Ord you can obviously implement Hashable/Eq. Whether it is *useful* to do so is another question. It turns out that it _is_ possible to define good quality hash functions on sets, but most code in the field to do so is pretty bad. (Just a modular sum or exclusive or.) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

nubBy is a very good suggestion. Added! Regarding good hash functions: if your data structure is algebraic, you can derive generic and Hashable will give you a pretty good hash function: data ADT a = C0 Int String | C1 [a] deriving Generic instance Hashable a = Hashable (ADT a) It's magic! - Clark On Mon, Jul 15, 2013 at 11:35 PM, Richard A. O'Keefe o...@cs.otago.ac.nz wrote: On 16/07/2013, at 3:21 PM, Clark Gaebel wrote: I'm still against having an Ord version, since my intuition tells me that hash-based data structures are faster than ordered ones. There are at least four different things that an Ord version might mean: - first sort a list, then eliminate duplicates - sort a list eliminating duplicates stably as you go (think 'merge sort', using 'union' instead of 'merge') - build a balanced tree set as you go - having a list that is already sorted, use that to eliminated duplicates cheaply. These things have different costs. For example, if there are N elements of which U are unique, the first as O(N.log N) cost, the third has O(N.log U) cost, and the fourth has O(N) cost. What I want is more often ordNubBy than ordNub, though. Someone else can write the patch, though! As a tangent, can anyone think of a data structure for which you can write an Ord instance but Hashable/Eq is impossible (or prove otherwise)? How about the converse? Since Ord has Eq as a superclass, and since 0 is a functionally correct hash value for anything, if you can implement Ord you can obviously implement Hashable/Eq. Whether it is *useful* to do so is another question. It turns out that it _is_ possible to define good quality hash functions on sets, but most code in the field to do so is pretty bad. (Just a modular sum or exclusive or.) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On 16 July 2013 10:31, Ivan Lazar Miljenovic ivan.miljeno...@gmail.com wrote: On 16 July 2013 11:46, John Lato jwl...@gmail.com wrote: In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than the current nub. Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)? If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/ This discussion (on -cafe@) is just about what course of action to take; adding such functions to containers or unordered-containers would not require a libraries@ proposal. Conrad. On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel cgae...@uwaterloo.ca wrote: Apologies. I was being lazy. Here's a stable version: import qualified Data.HashSet as S hashNub :: (Ord a) = [a] - [a] hashNub l = go S.empty l where go _ [] = [] go s (x:xs) = if x `S.member` s then go s xs else x : go (S.insert x s) xs Which, again, will probably be faster than the one using Ord, and I can't think of any cases where I'd want the one using Ord instead. I may just not be creative enough, though. - Clark On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery allber...@gmail.com wrote: On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel cgae...@uwaterloo.ca wrote: Oops sorry I guess my point wasn't clear. Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a Did the point about stable fly overhead? -- brandon s allbery kf8nh sine nomine associates allber...@gmail.com ballb...@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com http://IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

Similarly, I've always used: import qualified Data.HashSet as S nub :: Hashable a = [a] - [a] nub = S.toList . S.fromList And i can't think of any type which i can't write a Hashable instance, so this is extremely practical. On Jul 14, 2013 7:24 AM, Niklas Hambüchen m...@nh2.me wrote: tldr: nub is abnormally slow, we shouldn't use it, but we do. As you might know, Data.List.nub is O(n²). (*) As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub). I've taken the Ord-based O(n * log n) implementation from yi using a Set: ordNub :: (Ord a) = [a] - [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs and put benchmarks on http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html (compare `nub` vs `ordNub`). `ordNub` is not only in a different complexity class, but even seems to perform better than nub for very small numbers of actually different list elements (that's the numbers before the benchmark names). (The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?) What do you think about ordNub? I've seen a proposal from 5 years ago about adding a *sort*Nub function started by Neil, but it just died. (*) The mentioned complexity is for the (very common) worst case, in which the number of different elements in the list grows with the list (alias you don't have an N element list with always only 5 different things inside). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

Oops sorry I guess my point wasn't clear. Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a free function in Data.HashSet. If stability is a concern then there's a way to easily account for that using HashMap. - Clark On Jul 14, 2013 7:48 AM, Niklas Hambüchen m...@nh2.me wrote: One of my main points is: Should we not add such a function (ord-based, same output as nub, stable, no sorting) to base? As the package counting shows, if we don't offer an alternative, people obviously use it, and not to our benefit. (Not to say it this way: We could make the Haskell world fast with smarter fusion, strictness analysis and LLVM backends. Or we could stop using quadratic algorithms.) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

One of my main points is: Should we not add such a function (ord-based, same output as nub, stable, no sorting) to base? As the package counting shows, if we don't offer an alternative, people obviously use it, and not to our benefit. (Not to say it this way: We could make the Haskell world fast with smarter fusion, strictness analysis and LLVM backends. Or we could stop using quadratic algorithms.) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

Something like that should definitely be included in Data.List. Thanks for working on it. Roman * Niklas Hambüchen m...@nh2.me [2013-07-14 19:20:52+0800] tldr: nub is abnormally slow, we shouldn't use it, but we do. As you might know, Data.List.nub is O(n²). (*) As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub). I've taken the Ord-based O(n * log n) implementation from yi using a Set: ordNub :: (Ord a) = [a] - [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs and put benchmarks on http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html (compare `nub` vs `ordNub`). `ordNub` is not only in a different complexity class, but even seems to perform better than nub for very small numbers of actually different list elements (that's the numbers before the benchmark names). (The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?) What do you think about ordNub? I've seen a proposal from 5 years ago about adding a *sort*Nub function started by Neil, but it just died. (*) The mentioned complexity is for the (very common) worst case, in which the number of different elements in the list grows with the list (alias you don't have an N element list with always only 5 different things inside). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

At Sun, 14 Jul 2013 07:31:05 -0400, Clark Gaebel wrote: Similarly, I've always used: import qualified Data.HashSet as S nub :: Hashable a = [a] - [a] nub = S.toList . S.fromList And i can't think of any type which i can't write a Hashable instance, so this is extremely practical. Well, the above is not stable while Niklas’ is. But I guess that’s not the point of your message :). I’ve always avoided “nub” too, and FWIW I’d like a constrained version too—maybe avoiding Data.Set so that it could live in Data.List. I think Ord would be much better than Hashable, since it is 1. in “base” 2. much more established and understood. Although if you find yourself using “nub” too much you’re probably doing something wrong... Francesco ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On Sun, Jul 14, 2013 at 7:31 AM, Clark Gaebel cgae...@uwaterloo.ca wrote: Similarly, I've always used: import qualified Data.HashSet as S nub :: Hashable a = [a] - [a] nub = S.toList . S.fromList And i can't think of any type which i can't write a Hashable instance, so this is extremely practical. This won't yield results lazily (e.g. nub (repeat 'x') = _|_ instead of 'x' : _|_), but Niklas' ordNub will. His ordNub can be translated directly to HashSet and still have the stability and laziness properties. A difficulty with putting ordNub in Data.List is that it depends on containers, which is outside of the base package. Some options: * Move the implementation of Set to base. * Implement a lean version of Set in base that only provides 'insert' and 'member'. * Define ordNub in Data.Set instead. Adding a Hashable-based nub to base would be even more problematic, since you'd need Hashable in base. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

On 15 July 2013 09:54, Joey Adams joeyadams3.14...@gmail.com wrote: On Sun, Jul 14, 2013 at 7:31 AM, Clark Gaebel cgae...@uwaterloo.ca wrote: Similarly, I've always used: import qualified Data.HashSet as S nub :: Hashable a = [a] - [a] nub = S.toList . S.fromList And i can't think of any type which i can't write a Hashable instance, so this is extremely practical. This won't yield results lazily (e.g. nub (repeat 'x') = _|_ instead of 'x' : _|_), but Niklas' ordNub will. His ordNub can be translated directly to HashSet and still have the stability and laziness properties. A difficulty with putting ordNub in Data.List is that it depends on containers, which is outside of the base package. Some options: * Move the implementation of Set to base. * Implement a lean version of Set in base that only provides 'insert' and 'member'. * Define ordNub in Data.Set instead. Adding a Hashable-based nub to base would be even more problematic, since you'd need Hashable in base. Right, I suggest the following community course of action: 1a) add ordNub to Data.Set 1b) add ordNub to Data.Hashable (1 day) 2) make a libraries@ proposal to include a stripped-down Data.Set-like balanced binary tree implementation to base. (2 weeks) 3) bikeshed about the name, eg.: * is nub really intuitive? how about uniq, like in perl/ruby/underscore.js? * but uniq in unix only removes _adjacent_ duplicates, confusing! * how about distinct? sole? unique? azygous? (7 weeks) 4) Failing consensus on technical grounds (that the stripped-down Data.Set implementation is overkill for one library function), agree that anyone who really cares should just use the version from containers or hashable. Only newbs and textbook authors actually use base anyway, and it's impossible to change the language definition. Prelude will continue to fulfil its role of avoiding success at all costs, quadratic or otherwise. (Please, let's have both 1a and 1b :) Conrad. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

### Re: [Haskell-cafe] ordNub

Just so people are aware - five years ago the notion of nubOrd and nubWith was discussed and a consensus reached on including nubOrd. I think Bart got too busy, didn't submit a final patch, and no one with commit access actually commited any code. http://haskell.1045720.n5.nabble.com/GHC-2717-Add-nubWith-nubOrd-td3159919.html I fully support an efficient nub implementation making its way into base - it's far past time. Using Set seems sensible. Cheers, Thomas On Sun, Jul 14, 2013 at 4:20 AM, Niklas Hambüchen m...@nh2.me wrote: tldr: nub is abnormally slow, we shouldn't use it, but we do. As you might know, Data.List.nub is O(n²). (*) As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub). I've taken the Ord-based O(n * log n) implementation from yi using a Set: ordNub :: (Ord a) = [a] - [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs and put benchmarks on http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html (compare `nub` vs `ordNub`). `ordNub` is not only in a different complexity class, but even seems to perform better than nub for very small numbers of actually different list elements (that's the numbers before the benchmark names). (The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?) What do you think about ordNub? I've seen a proposal from 5 years ago about adding a *sort*Nub function started by Neil, but it just died. (*) The mentioned complexity is for the (very common) worst case, in which the number of different elements in the list grows with the list (alias you don't have an N element list with always only 5 different things inside). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe