Re: [Haskell-cafe] Sharing on equality

2011-12-19 Thread wren ng thornton

On 12/13/11 10:52 AM, Johan Brinch wrote:

Hey all,

Can GHC eliminate one of two equal ByteStrings, when they are compared
and turns out to be equal?


Say i have a map, ByteString -  Int.

I now do a lookup on a ByteString and if it exists, I insert this
ByteString into a list.

Is it possible to avoid using more memory, than used by the keys in
the map + the list structure?

I.e. is it possible to eliminate the redundant ByteStrings somehow?


Somehow? yes. Probably the easiest route is just to use:

-- Or better yet, use one of the HashMap structures
type MyMap a = Map ByteString (ByteString,a)

myInsert :: ByteString - a - MyMap a - MyMap a
myInsert k v = insert k (k,v)

myLookup :: ByteString - MyMap a - Maybe a
myLookup k = fmap snd . lookup k

...

If you really care a lot about memory overhead and sharing, there are 
two other approaches which are more work but have nice payoffs.


The first option is to use a trie structure which automatically prunes 
the key segments and allows you to reconstruct the keys. The 
bytestring-trie package does this. This approach has its own set of 
costs and benefits compared to plain mapping structures, so whether you 
want to go down this road will depend on what functionality you need. If 
you don't actually need the trie functionality (e.g., the ability to get 
the subtrie of all keys with a given prefix, or the ability to look up 
the values for all prefixes of a given key), then you're probably better 
off using a hashtable-based solution, like the hashmap or 
unordered-containers packages.


The second option is to use interning in order to ensure uniqueness of 
your expensive structures. That is, you implement the interface:


type InternId
-- e.g., newtype InternId = IId Int

instance Eq InternId
-- this should be faster than (Eq a)

-- you should also have instances for use in unboxed arrays, etc

type InternTable a
-- e.g., newtype InternTable a = IT (IntMap a, Map a InternId)

intern :: a - InternTable a - (InternTable a, InternId)

unintern :: InternId - InternTable a - a

...

And then you pass around the InternIds instead of the actual strings. 
This ensures low memory overhead for passing around and duplicating 
things, ensures fast equality comparisons, and ensures uniqueness 
whenever you need to deal with the actual structure instead of an 
identifier for it. Of course, you can generalize this idea to any data 
structure, not just strings; just google for hash consing.


I have a decently tuned implementation of ByteString interning laying 
around, which I'm hoping to put on Hackage before classes start up again 
in January. Of course, for extremely fine tuning we'd want to adjust the 
size of the InternId representation based on a heuristic upper-bound on 
the number of items we'll have, so that we can pack the unboxed arrays 
more tightly or do tricks like packing four 8-bit ids into a Word32. Of 
course, presenting a nice API for that would require using associated 
types which is GHC-only (the fundep version wouldn't be quite so 
pretty). I'll probably do that in the future once I locate some round tuits.


--
Live well,
~wren

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Sharing on equality

2011-12-14 Thread Ketil Malde
Johan Brinch brin...@gmail.com writes:

 Can GHC eliminate one of two equal ByteStrings, when they are compared
 and turns out to be equal?

Not in general, there is no guarantee that a is identical to b, just
because a == b.

 Say i have a map, ByteString - Int.

  Data.Map.Map ByteString Int

 I now do a lookup on a ByteString and if it exists, I insert this
 ByteString into a list.

 Is it possible to avoid using more memory, than used by the keys in
 the map + the list structure?

 I guess, this could be done by having lookup return the key as well,
 and then insert this key into the list, however, that's a bit ugly and
 somewhat anti-intuitive.

I think this /is/ intuitive.  (Or you could replace the key in the map,
but that will still leave duplicates in the list).

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Sharing on equality

2011-12-13 Thread Johan Brinch
Hey all,

Can GHC eliminate one of two equal ByteStrings, when they are compared
and turns out to be equal?


Say i have a map, ByteString - Int.

I now do a lookup on a ByteString and if it exists, I insert this
ByteString into a list.

Is it possible to avoid using more memory, than used by the keys in
the map + the list structure?

I.e. is it possible to eliminate the redundant ByteStrings somehow?

I guess, this could be done by having lookup return the key as well,
and then insert this key into the list, however, that's a bit ugly and
somewhat anti-intuitive.


Here's an example program, to illustrate the idea:
https://gist.github.com/1472418

-- 
Johan Brinch

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe