Re: [Haskell-cafe] ansi2html - one program, several issues

2008-07-24 Thread John Meacham
On Sun, Jul 20, 2008 at 09:55:15AM -0400, Isaac Dupree wrote:
 It doesn't stop it from parsing the entire file strictly. However, what 
 it does do is prevent the parser from backtracking out of arbitrary 
 amounts of lookahead. So, unless you use try (which allows for 
 lookahead), when any token is consumed by the parser, it can be garbage 
 collected (assuming the parser is the only thing pointing to the token 
 stream). So, it consumes its input strictly, but with limited overhead 
 (ideally using try only for some small bounded lookahead where it's 
 needed).

 So with Parsec, you can keep the *input* from filling up memory, but if  
 you do, the *result* will still take up space (e.g. Right (value)).  For  
 a simple transformation where the output is a similar string to the  
 input, it will be just as large, so not much space is actually saved  
 (maybe a factor of 2 -- just keeping the output, not also the input), it  
 seems.

Yeah, this is my understanding. frisby combats this via 'irrefutable'
parser combinators. An irrefutable combinator is one that always
succeeds, a prime example is the 'many' combinator. Since 'many'
consumes only as many of its arguments as it can and is perfectly fine
consuming nothing, it inherently always succeeds so the parser can
immediately begin returning results (before consuming all of the input).
Ironically, this means frisby often uses less space than other parsers,
despite being based on PEGs which generally are known for taking a lot
of space.

It is not too hard to ensure your optimizer is irrefutable, for
instance, the parser for a simple language might be

 many statement  eof

however, the 'eof' makes the parser non-irrefutabel. however it is easy
to gain back by doing

 many statement  (eof // pure (error unexpected data))

frisbys static analysis realizes that (irrefutable // ... ) and ( ... //
irrefutable) are irrefutable. 

John


-- 
John Meacham - ⑆repetae.net⑆john⑈
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] ansi2html - one program, several issues

2008-07-20 Thread John Meacham
On Sun, Jul 20, 2008 at 02:34:09AM +0400, Bulat Ziganshin wrote:
 i think that Parsec library should hold entire file in memory only when
 you use 'try' for whole file. otherwise it should omit data as
 proceeded

I do not believe that is the case, since the return type of runParser
Either ParseError a means that before you can extract the result of
the parse from the 'Right' branch, it must evaluate whether the result
is 'Left' or 'Right' meaning it needs to parse the whole input in order
to determine whether the parse was succesful.

This was the reason I made frisby's main parsing routine just be
(roughly)

 runPeg :: P a - String - a

so you have to do something explicit like

 runPegMaybe :: P a - String - Maybe a
 runPegMaybe p s = runPeg (fmap Just p // return Nothing) s

to force strictness in the parsing. 

Though, perhaps parsec is doing something more clever. I do know it uses
the one token lookahead trick to determine which branch to take on
alternation, but I don't think that solves the issue with parsing the
entire file..

John


-- 
John Meacham - ⑆repetae.net⑆john⑈
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] ansi2html - one program, several issues

2008-07-20 Thread Dan Doel
On Sunday 20 July 2008, John Meacham wrote:
 I do not believe that is the case, since the return type of runParser
 Either ParseError a means that before you can extract the result of
 the parse from the 'Right' branch, it must evaluate whether the result
 is 'Left' or 'Right' meaning it needs to parse the whole input in order
 to determine whether the parse was succesful.

 This was the reason I made frisby's main parsing routine just be
 (roughly)

  runPeg :: P a - String - a

 so you have to do something explicit like

  runPegMaybe :: P a - String - Maybe a
  runPegMaybe p s = runPeg (fmap Just p // return Nothing) s

 to force strictness in the parsing.

 Though, perhaps parsec is doing something more clever. I do know it uses
 the one token lookahead trick to determine which branch to take on
 alternation, but I don't think that solves the issue with parsing the
 entire file..

It doesn't stop it from parsing the entire file strictly. However, what it 
does do is prevent the parser from backtracking out of arbitrary amounts of 
lookahead. So, unless you use try (which allows for lookahead), when any 
token is consumed by the parser, it can be garbage collected (assuming the 
parser is the only thing pointing to the token stream). So, it consumes its 
input strictly, but with limited overhead (ideally using try only for some 
small bounded lookahead where it's needed).

By contrast, a naive parser combinator of the form:

p = foo | bar -- or p = try foo | bar in parsec

Might read the entire file into memory parsing foo, without any of it being 
garbage collected until completion, in case foo fails and a backtrack to bar 
is required.

Of course, this all assumes that the input to the parser can both be lazily 
generated, and discarded in pieces (so, not the case if reading an entire 
file into a strict byte string, for instance).

-- Dan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Re[2]: [Haskell-cafe] ansi2html - one program, several issues

2008-07-20 Thread Krzysztof Skrzętnicki
On Sun, Jul 20, 2008 at 7:25 AM, Chaddaï Fouché [EMAIL PROTECTED]
wrote:


  That's exactly what I thought. But even if I remove the only 'try' I use
 the
  memory consumption remains unchanged:

 It's true, but in your case your output is almost the raw input data,
 which means that even without a noxious try, you still have the
 whole file in memory. Well hopefully not with your latest code, which
 I would really like to see.


Here is the part that actually changed:

---
split c str = let (p,ps) = aux str in (p:ps)
where
  aux [] = ([],[])
  aux (x:cs) = let (xs,xss) = aux cs in
   if x == c then ([c],(xs:xss)) else ((x:xs),xss)

splitPred :: (Eq a) = (a - Bool) - [a] - [[a]]
splitPred pr str = let (p,ps) = aux str in (p:ps)
where
  aux [] = ([],[])
  aux (x:cs) = let (xs,xss) = aux cs in
   if pr x then ([],((x:xs):xss)) else ((x:xs),xss)

doOneFile :: String - IO ()
doOneFile fname = do
  t1 - getCurrentTime
  doesFileExist (fname ++ .html) = \b - if b then hPutStrLn stderr $
printf File already processed, skipping: %s fname else do
src - readFile fname
out - openFile (fname ++ .html) WriteMode
hSetBuffering out (BlockBuffering (Just 64000))
hPutStrLn out html
hPutStrLn out body bgcolor=\black\
hPutStrLn out meta http-equiv=\Content-Type\ content=\text/html;
charset=UTF-8\
hPutStrLn out span style=\font-family: monospace; font-size: 13;\
span
let extractData = \p - case p of
  Right x - x
  Left err - (trace . show $ err) []
let srcSplit = splitPred (`elem`\n) src
let parsed = concatMap (extractData . parse mainParser fname) srcSplit
execStateT (hPrintHtml (St id)) (out,emptyStyle) -- wypisujemy pierwszy
wiersz
execStateT (mapM_ hPrintHtml parsed) (out,emptyStyle)
hPutStrLn out /span/span
hPutStrLn out /body
hPutStrLn out /html
t2 - getCurrentTime
hPutStrLn stderr $ printf File %s processed. It took %s. File size was
%d characters. fname (show $ diffUTCTime t2 t1) (length src)
hClose out
--

The whole file is also attached. You will find there another (worse)
implementation of split and a little bit of code similar to thread pool
stuff.

On Sun, Jul 20, 2008 at 8:17 AM, John Meacham [EMAIL PROTECTED] wrote:

 On Sun, Jul 20, 2008 at 02:34:09AM +0400, Bulat Ziganshin wrote:
  i think that Parsec library should hold entire file in memory only when
  you use 'try' for whole file. otherwise it should omit data as
  proceeded

 I do not believe that is the case, since the return type of runParser
 Either ParseError a means that before you can extract the result of
 the parse from the 'Right' branch, it must evaluate whether the result
 is 'Left' or 'Right' meaning it needs to parse the whole input in order
 to determine whether the parse was succesful.


It's true it has to parse the whole file, but it is not true it has to
reside in the memory: only the results must be there. In this case, when the
result is 1-1 transformation of input, it is true. But consider this
program:

module Main where
import Text.ParserCombinators.Parsec

par = eof | (char 'a'  par)

alst = take 2 (repeat 'a')

main = print (runParser par ()  alst)

It runs in constant memory:

$ ./partest.exe +RTS -sstderr
C:\cygwin\home\Metharius\killer\killerPy\ansi2html\partest.exe +RTS -sstderr
Right ()
  84,326,845,636 bytes allocated in the heap
  22,428,536 bytes copied during GC
   9,684 bytes maximum residency (1 sample(s))
  13,848 bytes maximum slop
   1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 160845 collections, 0 parallel,  0.63s,  0.63s elapsed
  Generation 1: 1 collections, 0 parallel,  0.00s,  0.00s elapsed

  INIT  time0.02s  (  0.00s elapsed)
  MUT   time   54.31s  ( 54.55s elapsed)
  GCtime0.63s  (  0.63s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time   54.95s  ( 55.17s elapsed)

  %GC time   1.1%  (1.1% elapsed)

  Alloc rate1,552,176,623 bytes per MUT second

  Productivity  98.8% of total user, 98.4% of total elapsed


Best regards
Christopher Skrzętnicki


ansi2html.hs
Description: Binary data
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] ansi2html - one program, several issues

2008-07-20 Thread Isaac Dupree

Dan Doel wrote:

On Sunday 20 July 2008, John Meacham wrote:

I do not believe that is the case, since the return type of runParser
Either ParseError a means that before you can extract the result of
the parse from the 'Right' branch, it must evaluate whether the result
is 'Left' or 'Right' meaning it needs to parse the whole input in order
to determine whether the parse was succesful.


...

It doesn't stop it from parsing the entire file strictly. However, what it 
does do is prevent the parser from backtracking out of arbitrary amounts of 
lookahead. So, unless you use try (which allows for lookahead), when any 
token is consumed by the parser, it can be garbage collected (assuming the 
parser is the only thing pointing to the token stream). So, it consumes its 
input strictly, but with limited overhead (ideally using try only for some 
small bounded lookahead where it's needed).


So with Parsec, you can keep the *input* from filling up memory, but if 
you do, the *result* will still take up space (e.g. Right (value)).  For 
a simple transformation where the output is a similar string to the 
input, it will be just as large, so not much space is actually saved 
(maybe a factor of 2 -- just keeping the output, not also the input), it 
seems.


-Isaac
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Re[2]: [Haskell-cafe] ansi2html - one program, several issues

2008-07-20 Thread Krzysztof Skrzętnicki
I played with another approach without any parser library, just with plain
pattern matching. The idea was to create function to match all different
cases of codes. Since I already got most of the code, it was quite easy to
do. The core function consist of cases like those:

  parse ('\ESC':'[':'1':';':'4':'0':'m':rest) = modifyAndPrint (\x - x
{ bgcol = light black }) parse rest
  parse ('\ESC':'[':'1':';':'4':'1':'m':rest) = modifyAndPrint (\x - x
{ bgcol = light red }) parse rest
  parse ('\ESC':'[':'1':';':'4':'2':'m':rest) = modifyAndPrint (\x - x
{ bgcol = light green }) parse rest
  parse ('\ESC':'[':'1':';':'4':'3':'m':rest) = modifyAndPrint (\x - x
{ bgcol = light yellow }) parse rest

If you have read the old code you should recognize some parts of it here.
It should consume rather constant amount of memory. To my surprise it
consumed almost exactly the same amount of memory as the previous program.
Turns out the problematic line was this:

hPutStrLn stderr $ printf File %s processed. It took %s. File size was
%d characters. fname (show $ diffUTCTime t2 t1) *(length src)*

It computed length of the input file. Needless to say, because src was
actually the input file parsed previously, it was all hanging in the memory.
Having removed that reference to src both programs (the one that parses
input per line and the most recent one) are running in constant memory
(2Mb). This doesn't apply to the first program, which has to read whole file
before producing any output.

And the last note: the new program is also 2x faster, perhaps due to very
simple structure that is easy to optimize. It also makes sense now to use
mapMPar as it reduces run time by 30%. The full code is in attachments.

Best regards
Christopher Skrzętnicki


ansi2html.hs
Description: Binary data


ansi2html.hs
Description: Binary data
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] ansi2html - one program, several issues

2008-07-19 Thread Chaddaï Fouché
2008/7/19 Krzysztof Skrzętnicki [EMAIL PROTECTED]:
 Hi all

 1) Profiling shows that very simple functions are source of great memory and
 time consumption. However, if I turn them off and simply print their input
 arguments instead, the overall time and memory consumption doesn't change.
 But now another function is acting badly. My guess: somehow the cost of
 Parsec code is shifted into whatever function is using it's output. Let's
 see:

Are you using Parsec to parse the whole file ? Then your problem is
there : Parsec needs to read and process the whole file before it can
give us any output since it thinks it could have to give us an error
instead and it can't be sure of that before he has read the whole
thing...
In your case, your problem is such that you would prefer to treat the
file as a stream, isn't it ?
There are some parser library that can give output lazily (look at
polyparse flavour), another option would be to only use Parsec where
you need it and just read and print the ordinary text for example.

-- 
Jedaï
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] ansi2html - one program, several issues

2008-07-19 Thread Krzysztof Skrzętnicki
I forgot to mention that the memory consumption is several times higher than
file size. On 8,3 Mb file:
 532 MB total memory in use (4 MB lost due to fragmentation).

Having that 8 Mb in memory is not the problem. 532 Mb is another story. In
general, the program consumes roughly 64 times more memory than file size
and it scales linearly.


Best regards
Christopher Skrzętnicki

On Sat, Jul 19, 2008 at 9:52 PM, Chaddaï Fouché [EMAIL PROTECTED]
wrote:

 2008/7/19 Krzysztof Skrzętnicki [EMAIL PROTECTED]:
  Hi all
 
  1) Profiling shows that very simple functions are source of great memory
 and
  time consumption. However, if I turn them off and simply print their
 input
  arguments instead, the overall time and memory consumption doesn't
 change.
  But now another function is acting badly. My guess: somehow the cost of
  Parsec code is shifted into whatever function is using it's output. Let's
  see:

 Are you using Parsec to parse the whole file ? Then your problem is
 there : Parsec needs to read and process the whole file before it can
 give us any output since it thinks it could have to give us an error
 instead and it can't be sure of that before he has read the whole
 thing...
 In your case, your problem is such that you would prefer to treat the
 file as a stream, isn't it ?
 There are some parser library that can give output lazily (look at
 polyparse flavour), another option would be to only use Parsec where
 you need it and just read and print the ordinary text for example.

 --
 Jedaï

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: [Haskell-cafe] ansi2html - one program, several issues

2008-07-19 Thread Bulat Ziganshin
Hello Krzysztof,

Sunday, July 20, 2008, 12:49:54 AM, you wrote:

on the 32-bit computers 36x memreqs for storing large strings in
memory is a rule, on 64-bit ones - 72x


 I forgot to mention that the memory consumption is several times
 higher than file size. On 8,3 Mb file:
 532 MB total memory in use (4 MB lost due to fragmentation).

 Having that 8 Mb in memory is not the problem. 532 Mb is another
 story. In general, the program consumes roughly 64 times more memory
 than file size and it scales linearly.
  

 Best regards
 Christopher Skrzetnicki

 On Sat, Jul 19, 2008 at 9:52 PM, Chaddai Fouche [EMAIL PROTECTED] wrote:
 2008/7/19 Krzysztof Skrzetnicki  [EMAIL PROTECTED]:
 Hi all 
  

 1) Profiling shows that very simple functions are source of great memory and 
 time consumption. However, if I turn them off and simply print their input 
 arguments instead, the overall time and memory consumption doesn't change. 
 But now another function is acting badly. My guess: somehow the cost of 
 Parsec code is shifted into whatever function is using it's output. Let's 
 see: 
  
  
 Are you using Parsec to parse the whole file ? Then your problem is
 there : Parsec needs to read and process the whole file before it can 
 give us any output since it thinks it could have to give us an error 
 instead and it can't be sure of that before he has read the whole 
 thing... 
 In your case, your problem is such that you would prefer to treat the 
 file as a stream, isn't it ? 
 There are some parser library that can give output lazily (look at 
 polyparse flavour), another option would be to only use Parsec where 
 you need it and just read and print the ordinary text for example. 
  



-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] ansi2html - one program, several issues

2008-07-19 Thread Chaddaï Fouché
2008/7/19 Krzysztof Skrzętnicki [EMAIL PROTECTED]:
 I forgot to mention that the memory consumption is several times higher than
 file size. On 8,3 Mb file:
 532 MB total memory in use (4 MB lost due to fragmentation).

 Having that 8 Mb in memory is not the problem. 532 Mb is another story. In
 general, the program consumes roughly 64 times more memory than file size
 and it scales linearly.

You should be using ByteString, though this problem would be
alleviated if you were consuming the file as a stream.

-- 
Jedaï
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] ansi2html - one program, several issues

2008-07-19 Thread Krzysztof Skrzętnicki
On Sat, Jul 19, 2008 at 11:35 PM, Chaddaï Fouché [EMAIL PROTECTED]
wrote:

 2008/7/19 Krzysztof Skrzętnicki [EMAIL PROTECTED]:
  I forgot to mention that the memory consumption is several times higher
 than
  file size. On 8,3 Mb file:
  532 MB total memory in use (4 MB lost due to fragmentation).
 
  Having that 8 Mb in memory is not the problem. 532 Mb is another story.
 In
  general, the program consumes roughly 64 times more memory than file size
  and it scales linearly.

 You should be using ByteString, though this problem would be
 alleviated if you were consuming the file as a stream.


Since ANSI color codes doesn't contain characters like newline or space, I
have simply split input file into such lines. Now the whole program behaves
much better: GC time is below 10% and memory consumption dropped to 74 Mb
per thread. It's still a lot of memory though and it certainly holds much
more than one line of text.

Best regards
Christopher Skrzętnicki
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: [Haskell-cafe] ansi2html - one program, several issues

2008-07-19 Thread Bulat Ziganshin
Hello Krzysztof,

Sunday, July 20, 2008, 1:55:45 AM, you wrote:
 532 MB total memory in use (4 MB lost due to fragmentation).

i think that Parsec library should hold entire file in memory only when
you use 'try' for whole file. otherwise it should omit data as
proceeded


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Re[2]: [Haskell-cafe] ansi2html - one program, several issues

2008-07-19 Thread Krzysztof Skrzętnicki
On Sun, Jul 20, 2008 at 12:34 AM, Bulat Ziganshin [EMAIL PROTECTED]
wrote:

 Hello Krzysztof,

 Sunday, July 20, 2008, 1:55:45 AM, you wrote:
  532 MB total memory in use (4 MB lost due to fragmentation).

 i think that Parsec library should hold entire file in memory only when
 you use 'try' for whole file. otherwise it should omit data as
 proceeded


That's exactly what I thought. But even if I remove the only 'try' I use the
memory consumption remains unchanged:

C:\cygwin\home\Metharius\killer\KillerPy\ansi2html\ansi2html_old.exe
duzy.log +RTS -sstderr
File duzy.log processed. It took 5.046875s. File size was 4166578
characters.
   3,950,649,704 bytes allocated in the heap
 535,544,056 bytes copied during GC
 117,603,408 bytes maximum residency (9 sample(s))
   1,647,828 bytes maximum slop
 265 MB total memory in use (2 MB lost due to fragmentation)

  Generation 0:  7527 collections, 0 parallel,  0.86s,  0.86s elapsed
  Generation 1: 9 collections, 0 parallel,  0.80s,  0.81s elapsed

  INIT  time0.02s  (  0.00s elapsed)
  MUT   time3.20s  (  3.63s elapsed)
  GCtime1.66s  (  1.67s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time4.88s  (  5.30s elapsed)

  %GC time  34.0%  (31.6% elapsed)

  Alloc rate1,227,386,315 bytes per MUT second

  Productivity  65.7% of total user, 60.5% of total elapsed



One more thing to note: with partial parsing there is no longer a difference
between mapM_ and mapMPar.

Best regards
Christopher Skrzętnicki
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Re[2]: [Haskell-cafe] ansi2html - one program, several issues

2008-07-19 Thread Chaddaï Fouché
2008/7/20 Krzysztof Skrzętnicki [EMAIL PROTECTED]:
 On Sun, Jul 20, 2008 at 12:34 AM, Bulat Ziganshin
 [EMAIL PROTECTED] wrote:

 Hello Krzysztof,

 Sunday, July 20, 2008, 1:55:45 AM, you wrote:
  532 MB total memory in use (4 MB lost due to fragmentation).

 i think that Parsec library should hold entire file in memory only when
 you use 'try' for whole file. otherwise it should omit data as
 proceeded


 That's exactly what I thought. But even if I remove the only 'try' I use the
 memory consumption remains unchanged:

It's true, but in your case your output is almost the raw input data,
which means that even without a noxious try, you still have the
whole file in memory. Well hopefully not with your latest code, which
I would really like to see.

-- 
Jedaï
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe