[Haskell-cafe] Re: Optimizing spelling correction program
Using Bryan O'Sullivan's fantastic BloomFilter I got it down below Python's run time! Now it is 35.56s, 28% of the time is spent on GC, which I think means there is still some room for improvement. One easy way to fix the GC time is to increase the default heap size. ./a.out +RTS -A200M It does make the GC only 1.4% of run time but it increases it overall by 14s. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: Optimizing spelling correction program
On Jun 22, 10:12 pm, Bulat Ziganshin bulat.zigans...@gmail.com wrote: Hello Kamil, Tuesday, June 23, 2009, 12:54:49 AM, you wrote: I went back to using Strings instead of ByteStrings and with that hashtable the program finishes in 31.5s! w00t! and GC times are? also, try ByteString+HT, it should be pretty easy to write hashByteString GC time is 10%. I'll try ByteString+HT tonight. It might indeed be faster because BloomFilter+String is slower that BloomFilter +ByteString. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: Optimizing spelling correction program
On Jun 22, 10:03 am, Eugene Kirpichov ekirpic...@gmail.com wrote: Hey, you're using String I/O! nWORDS - fmap (train . map B.pack . words) (readFile big.txt) This should be WORDS - fmap (train . B.words) (B.readFile big.txt) By the way, which exact file do you use as a misspellings file? The corpus linked to at Norvig's page has many. And do you have a driver program that I could run and obtain your timings? Yep, Don pointed that out and I have changed the program accordingly. It didn't make any difference though. The time spent on building the dictionary is a small portion of the overall run time. Please see the repository contents for the current version of the program: http://patch-tag.com/r/spellcorrect/snapshot/current/content/pretty The eval-bytestring.hs there is the program I used for timing. Inside of it you will find the name of the misspellings file needed. Thanks all for the suggestions. I'll try them when I get home tonight. -- Kamil Dworakowski ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: Optimizing spelling correction program
On Jun 22, 6:46 am, Bulat Ziganshin bulat.zigans...@gmail.com wrote: Hello Kamil, Monday, June 22, 2009, 12:01:40 AM, you wrote: Right... Python uses hashtables while here I have a tree with log n you can try this pure hashtable approach: import Prelude hiding (lookup) import qualified Data.HashTable import Data.Array import qualified Data.List as List data HT a b = HT (a-Int) (Array Int [(a,b)]) -- size is the size of array (we implent closed hash) -- hash is the hash function (a-Int) -- list is assoclist of items to put in hash create size hash list = HT hashfunc (accumArray (flip (:)) [] (0, arrsize-1) (map (\(a,b) - (hashfunc a,b)) list) ) where arrsize = head$ filter (size)$ iterate (\x-3*x+1) 1 hashfunc a = hash a `mod` arrsize lookup a (HT hash arr) = List.lookup a (arr!hash a) main = do let assoclist = [(one, 1), (two, 2), (three, 3)] hash = create 10 (fromEnum . Data.HashTable.hashString) assoclist print (lookup one hash) print (lookup zero hash) It does not compile: No instance for (Num (String, b)) arising from the literal `3' at foo.hs:23:61 Possible fix: add an instance declaration for (Num (String, b)) In the expression: 3 In the expression: (three, 3) In the expression: [(one, 1), (two, 2), (three, 3)] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: Optimizing spelling correction program
On Jun 22, 9:10 am, Ketil Malde ke...@malde.org wrote: Kamil Dworakowski ka...@dworakowski.name writes: Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3]. If you are considering alternative data structures, you might want to look at tries or Bloom filters, both have O(n) lookup, both have Haskell implementations. The latter is probably faster but probabilistic (i.e. it will occasionally fail to detect a misspelling - which you can of course check against a real dictionary). Using Bryan O'Sullivan's fantastic BloomFilter I got it down below Python's run time! Now it is 35.56s, 28% of the time is spent on GC, which I think means there is still some room for improvement. Do you say that PerfectHash runs with a penalty of calling into c library? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: Optimizing spelling correction program
On Jun 22, 9:06 pm, Daniel Fischer daniel.is.fisc...@web.de wrote: Am Montag 22 Juni 2009 21:31:50 schrieb Kamil Dworakowski: On Jun 22, 6:46 am, Bulat Ziganshin bulat.zigans...@gmail.com wrote: Hello Kamil, Monday, June 22, 2009, 12:01:40 AM, you wrote: Right... Python uses hashtables while here I have a tree with log n you can try this pure hashtable approach: import Prelude hiding (lookup) import qualified Data.HashTable import Data.Array import qualified Data.List as List data HT a b = HT (a-Int) (Array Int [(a,b)]) -- size is the size of array (we implent closed hash) -- hash is the hash function (a-Int) -- list is assoclist of items to put in hash create size hash list = HT hashfunc (accumArray (flip (:)) [] (0, arrsize-1) (map (\(a,b) - (hashfunc a,b)) Typo: should be map (\(a,b) - (hashfunc a, (a,b)) Wait! Have you typed that definition into the msg off the top of your head? :) I went back to using Strings instead of ByteStrings and with that hashtable the program finishes in 31.5s! w00t! ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Optimizing spelling correction program
Hi all, I want to learn how to optimize Haskell code on the Norvig's spelling correction program [1]. Peter Norvig has written the program in Python, quite a literate translation to Haskell (thanks to Grzegorz Chrupala) [2] was also available. Both versions are about 23 LOCs, and Haskell version is four times slower, 2m25s versus 36s. All the numbers I give come from running the program on a list of 806 misspellings, Haskell version compiled with - O2 and -fforce-recomp flags. I started with trial and error approach. Just preventing a repetitive call to keysSet brought the running time down to 1m48s. I also restructured the edits1 function and dropped all uses of sets, which haven't been pulling their own weight. This brought the running time down to 55s. Not quite there yet but close. Reading the Profiling chapter in the RWH book enabled me to do some more informed optimizing. At this point the GC time was about 2% of the overall runnig time, Grzegorz has done a good job of using strict versions of functions where necessary. The timing per expression however showed something useful, 70% of the time was being spent around determining membership in the known words dictionary (Data.Map). There are about 30 000 items in the dictionary. Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3]. The PerfectHash is a dictionary keyed by ByteStrings, so I had to migrate the program to use these instead of Strings. Sadly it increased the running time to 59s. Adding PerfectHash brought the running time down to 39s. I have done some code restructuring with the view to publishing it, which brought the running time up to 45s. The code is available on patch-tag [4] for anyone to scrutinize and hopefully point out mistakes (it uses big.txt and misspellings file which are linked to from Norvig's page [1]). At this point [5] I don't know how to proceed. The program is still slower than Python. The -ddump-simpl dump does not give me any clues, there is too much of it and I don't understand most of it. The GC time is about 10% of the total, and according to the profiler almost half the time is spent in the member function, and almost all the rest in edits1. Shouldn't it run much faster than the interpreted Python? Frankly, I am not impressed with the PerfectHash performance; It looks as if it is only two times faster than Data.Map. Is this because of the number of elements? 1. http://www.norvig.com/spell-correct.html 2. http://pitekus.blogspot.com/2007/04/norvigs-spelling-corrector-in-haskell.html 3. http://hackage.haskell.org/package/PerfectHash 4. http://patch-tag.com/r/spellcorrect/ 5. http://patch-tag.com/r/spellcorrect/snapshot/hash/20090621192727-a99e5-fed48a7ccf07b79c572fddb0304648fe80d39904/content/pretty/SpellingCorrection.hs ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: Optimizing spelling correction program
What are the keys in your Map? Strict ByteStrings? And you're using ghc 6.10.x? For the record, I use 6.10.1 and strict ByteString everywhere now. I used to have some lazy IO with vanilla strings, but switching to Data.ByteString.Char8.readFile didn't change the time at all. The big.txt is about 6megs. Profiling output says there are about 100 000 000 entries for the lookup function. http://patch-tag.com/r/spellcorrect/snapshot/current/content/pretty ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe