It should be possible to implement Huffman encoding in such a
manner that only the Huffman tree is permanently in memory.

But you need to read the input file twice: you cannot encode the first
character of your input until the entire tree has been
constructed. And to construct the tree you need the frequencies of
characters, so you need to scan the entire input before you can start
encoding. But as you noticed, that only buys you 1 MB, so that must be
about the size of the representation of the input list.

Therefore, your space eater must be one of
1. the freq_arry or its construction
2. the Huffman tree itself (unlikely: it has only 256 nodes and each
   entry holds an Int)
3. some problem with the construction of the output list, where you
   can loose quite a lot, eg. do you produce any output before the
   entire input has been encoded, and how do you implement the
   necessary bit-fiddling?
You can find out by inserting cost centres (if you use ghc) at the
suspicious places. In the documentation you also find hints on
improving the space/time behavior of your programs, eg. is any of your
functions overloaded?

Finally, I would like to mention that the LZW-algorithm is somewhat
better suited to be implemented in Haskell, since it produces output
immediately. I have recently written a version of it which runs (well:
walks) inside of 200k heap to encode a 53k file. It makes, however,
use of ghc's imperative features (mutable arrays) in an essential
way. There is also a version in standard Haskell (originally due to
Sanders and Runciman and tuned by Will Partain) in the nofib benchmark
suite, which performs remarkably well.

I wasn't going to release my code, but if you are interested you can
find it (including a comparison with some other implementations)
through my web page at URL

http://www-pu.informatik.uni-tuebingen.de/~thiemann/haskell

-Peter

[mentioning only ghc just means I haven't used another Haskell 
 compiler recently.]

Reply via email to