[Haskell-cafe] Re: Optimizing spelling correction program

2009-06-23 Thread Kamil Dworakowski
  Using Bryan O'Sullivan's fantastic BloomFilter I got it down below
  Python's run time! Now it is 35.56s, 28% of the time is spent on GC,
  which I think means there is still some room for improvement.

 One easy way to fix the GC time is to increase the default heap size.

  ./a.out +RTS -A200M

It does make the GC only 1.4% of run time but it  increases it overall
by 14s.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Optimizing spelling correction program

2009-06-23 Thread Kamil Dworakowski
On Jun 22, 10:12 pm, Bulat Ziganshin bulat.zigans...@gmail.com
wrote:
 Hello Kamil,

 Tuesday, June 23, 2009, 12:54:49 AM, you wrote:

  I went back to using Strings instead of ByteStrings and with that
  hashtable the program finishes in 31.5s! w00t!

 and GC times are? also, try ByteString+HT, it should be pretty easy to
 write hashByteString

GC time is 10%. I'll try ByteString+HT tonight. It might indeed be
faster because BloomFilter+String is slower that BloomFilter
+ByteString.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Optimizing spelling correction program

2009-06-22 Thread Kamil Dworakowski
On Jun 22, 10:03 am, Eugene Kirpichov ekirpic...@gmail.com wrote:
 Hey, you're using String I/O!

 nWORDS - fmap (train . map B.pack . words) (readFile big.txt)

 This should be

 WORDS - fmap (train . B.words) (B.readFile big.txt)

 By the way, which exact file do you use as a misspellings file? The
 corpus linked to at Norvig's page has many.
 And do you have a driver program that I could run and obtain your timings?

Yep, Don pointed that out and I have changed the program accordingly.
It didn't make any difference though. The time spent on building the
dictionary is a small portion of the overall run time.

Please see the repository contents for the current version of the
program:
http://patch-tag.com/r/spellcorrect/snapshot/current/content/pretty

The eval-bytestring.hs there is the program I used for timing. Inside
of it you will find the name of the misspellings file needed.

Thanks all for the suggestions. I'll try them when I get home tonight.

--
Kamil Dworakowski
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Optimizing spelling correction program

2009-06-22 Thread Kamil Dworakowski

On Jun 22, 6:46 am, Bulat Ziganshin bulat.zigans...@gmail.com wrote:
 Hello Kamil,

 Monday, June 22, 2009, 12:01:40 AM, you wrote:

  Right... Python uses hashtables while here I have a tree with log n

 you can try this pure hashtable approach:

 import Prelude hiding (lookup)
 import qualified Data.HashTable
 import Data.Array
 import qualified Data.List as List

 data HT a b = HT (a-Int) (Array Int [(a,b)])

 -- size is the size of array (we implent closed hash)
 -- hash is the hash function (a-Int)
 -- list is assoclist of items to put in hash
 create size hash list = HT hashfunc
                            (accumArray (flip (:))
                                        []
                                        (0, arrsize-1)
                                        (map (\(a,b) - (hashfunc a,b)) list)
                            )

   where arrsize     =  head$ filter (size)$ iterate (\x-3*x+1) 1
         hashfunc a  =  hash a `mod` arrsize

 lookup a (HT hash arr) = List.lookup a (arr!hash a)

 main = do let assoclist = [(one, 1), (two, 2), (three, 3)]
               hash = create 10 (fromEnum . Data.HashTable.hashString) 
 assoclist
           print (lookup one hash)
           print (lookup zero hash)

It does not compile:

No instance for (Num (String, b))
  arising from the literal `3' at foo.hs:23:61
Possible fix: add an instance declaration for (Num (String, b))
In the expression: 3
In the expression: (three, 3)
In the expression: [(one, 1), (two, 2), (three, 3)]


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Optimizing spelling correction program

2009-06-22 Thread Kamil Dworakowski
On Jun 22, 9:10 am, Ketil Malde ke...@malde.org wrote:
 Kamil Dworakowski ka...@dworakowski.name writes:
  Right... Python uses hashtables while here I have a tree with log n
  access time. I did not want to use the Data.HashTable, it would
  pervade my program with IO. The alternative is an ideal hashmap that never
  gets changed. This program creates a dictionary at start which then is only
  being used to read from: an ideal application for the Data.PerfectHash
  by Mark Wotton available on Hackage [3].

 If you are considering alternative data structures, you might want to
 look at tries or Bloom filters, both have O(n) lookup, both have
 Haskell implementations.  The latter is probably faster but
 probabilistic (i.e. it will occasionally fail to detect a
 misspelling - which you can of course check against a real
 dictionary).

Using Bryan O'Sullivan's fantastic BloomFilter I got it down below
Python's run time! Now it is 35.56s, 28% of the time is spent on GC,
which I think means there is still some room for improvement.

Do you say that PerfectHash runs with a penalty of calling into c
library?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Optimizing spelling correction program

2009-06-22 Thread Kamil Dworakowski


On Jun 22, 9:06 pm, Daniel Fischer daniel.is.fisc...@web.de wrote:
 Am Montag 22 Juni 2009 21:31:50 schrieb Kamil Dworakowski:



  On Jun 22, 6:46 am, Bulat Ziganshin bulat.zigans...@gmail.com wrote:
   Hello Kamil,

   Monday, June 22, 2009, 12:01:40 AM, you wrote:
Right... Python uses hashtables while here I have a tree with log n

   you can try this pure hashtable approach:

   import Prelude hiding (lookup)
   import qualified Data.HashTable
   import Data.Array
   import qualified Data.List as List

   data HT a b = HT (a-Int) (Array Int [(a,b)])

   -- size is the size of array (we implent closed hash)
   -- hash is the hash function (a-Int)
   -- list is assoclist of items to put in hash
   create size hash list = HT hashfunc
                              (accumArray (flip (:))
                                          []
                                          (0, arrsize-1)
                                          (map (\(a,b) - (hashfunc a,b))

 Typo: should be

 map (\(a,b) - (hashfunc a, (a,b))

Wait! Have you typed that definition into the msg off the top of your
head? :)

I went back to using Strings instead of ByteStrings and with that
hashtable the program finishes in 31.5s! w00t!
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Optimizing spelling correction program

2009-06-21 Thread Kamil Dworakowski
Hi all,

I want to learn how to optimize Haskell code on the Norvig's spelling
correction program [1]. Peter Norvig has written the program in
Python,
quite a literate translation to Haskell (thanks to Grzegorz Chrupala)
[2] was also available.

Both versions are about 23 LOCs, and Haskell version is four times
slower, 2m25s versus 36s. All the numbers I give come from running the
program on a list of 806 misspellings, Haskell version compiled with -
O2
and -fforce-recomp flags.

I started with trial and error approach. Just preventing a repetitive
call to keysSet brought the running time down to 1m48s. I also
restructured the edits1 function and dropped all uses of sets, which
haven't been pulling their own weight. This brought the running time
down to 55s. Not quite there yet but close.

Reading the Profiling chapter in the RWH book enabled me to do some
more
informed optimizing. At this point the GC time was about 2% of the
overall runnig time, Grzegorz has done a good job of using strict
versions of functions where necessary. The timing per expression
however
showed something useful, 70% of the time was being spent around
determining membership in the known words dictionary (Data.Map). There
are about 30 000 items in the dictionary.

Right... Python uses hashtables while here I have a tree with log n
access time. I did not want to use the Data.HashTable, it would
pervade
my program with IO. The alternative is an ideal hashmap that never
gets
changed. This program creates a dictionary at start which then is only
being used to read from: an ideal application for the Data.PerfectHash
by Mark Wotton available on Hackage [3].

The PerfectHash is a dictionary keyed by ByteStrings, so I had to
migrate the program to use these instead of Strings. Sadly it
increased
the running time to 59s. Adding PerfectHash brought the running time
down to 39s. I have done some code restructuring with the view to
publishing it, which brought the running time up to 45s. The code is
available on patch-tag [4] for anyone to scrutinize and hopefully
point
out mistakes (it uses big.txt and misspellings file which are linked
to
from Norvig's page [1]).

At this point [5] I don't know how to proceed. The program is still
slower than Python. The -ddump-simpl dump does not give me any clues,
there is too much of it and I don't understand most of it. The GC time
is about 10% of the total, and according to the profiler almost half
the
time is spent in the member function, and almost all the rest in
edits1.
Shouldn't it run much faster than the interpreted Python?

Frankly, I am not impressed with the PerfectHash performance; It looks
as if it is only two times faster than Data.Map. Is this because of
the
number of elements?

1. http://www.norvig.com/spell-correct.html
2. 
http://pitekus.blogspot.com/2007/04/norvigs-spelling-corrector-in-haskell.html
3. http://hackage.haskell.org/package/PerfectHash
4. http://patch-tag.com/r/spellcorrect/
5.
http://patch-tag.com/r/spellcorrect/snapshot/hash/20090621192727-a99e5-fed48a7ccf07b79c572fddb0304648fe80d39904/content/pretty/SpellingCorrection.hs
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Optimizing spelling correction program

2009-06-21 Thread Kamil Dworakowski
 What are the keys in your Map? Strict ByteStrings?

 And you're using ghc 6.10.x?

For the record, I use 6.10.1 and strict ByteString everywhere now. I
used to have some lazy IO with vanilla strings, but switching to
Data.ByteString.Char8.readFile didn't change the time at all. The
big.txt is about 6megs. Profiling output says there are about 100 000
000 entries for the lookup function.

http://patch-tag.com/r/spellcorrect/snapshot/current/content/pretty
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe