Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-21 Thread Henning Thielemann

On Mon, 14 May 2007, Malcolm Wallace wrote:

 Henning Thielemann [EMAIL PROTECTED] wrote:

*Text.ParserCombinators.PolyLazy
  runParser (exactly 4 (satisfy Char.isAlpha))
  (abc104++undefined)
(*** Exception: Parse.satisfy: failed
 
  How can I rewrite the above example that it returns
(abc*** Exception: Parse.satisfy: failed

 The problem in your example is that the 'exactly' combinator forces
 parsing of 'n' items before returning any of them.  If you roll your
 own, then you can return partial results:

  let one = return (:) `apply` satisfy (Char.isAlpha)
   in runParser (one `apply` (one `apply`
(one `apply` (one `apply` return []
  (abc104++undefined)
 (abc*** Exception: Parse.satisfy: failed

 Equivalently:

  let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f
   in runParser (one (one (one (one (return []) (abc104++undefined)
 (abc*** Exception: Parse.satisfy: failed

 I wonder whether 'apply' merges two separate ideas: Applying a generated
function to some parser generated value and forcing some parser to always
succeed. From the documentation of 'apply' I assumed that 'apply f x'
fails if 'f' or 'x' fails. In contrast to that it seems to succeed if only
'f' succeeds. Wouldn't it be better to have an explicit 'force' which
declares a parser to never fail - and to return 'undefined' if this
assumption is wrong.
 I have seen this 'force' in the MIDI loader of Haskore:
  http://darcs.haskell.org/haskore/src/Haskore/General/Parser.hs

 It would hold:
   apply f x  ==  do g - f; fmap g (force x)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-14 Thread Henning Thielemann

On Fri, 11 May 2007, Malcolm Wallace wrote:

  *Text.ParserCombinators.PolyLazy
runParser (exactly 4 (satisfy Char.isAlpha)) (abc104++undefined)
  (*** Exception: Parse.satisfy: failed

 This output is exactly correct.  You asked for the first four characters
 provided that they were alphabetic, but in fact only the first three
 were alphabetic.  Hence, 'satisfy' failed and threw an exception.  If
 you ask for only the first three characters, then the parse succeeds:

The problem is obviously that a later wrong character can make the whole
parse fail. Thus successful generated data is not returned until the whole
input is parsed and checked. How can I suppress checking the whole input?
How can I tell the parser that everything it parsed so far will not be
invalidated by further input? How can I rewrite the above example that it
returns
  (abc*** Exception: Parse.satisfy: failed
?

I wondered whether 'commit' helps, but it didn't. (I thought it would
convert a global 'fail' to a local 'error'.)

*Text.ParserCombinators.PolyLazy
runParser (exactly 4 (commit (satisfy Char.isAlpha))) (abc104++undefined)
*** Exception: Parse.satisfy: failed
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-14 Thread Malcolm Wallace
Henning Thielemann [EMAIL PROTECTED] wrote:

   *Text.ParserCombinators.PolyLazy
 runParser (exactly 4 (satisfy Char.isAlpha))
 (abc104++undefined)
   (*** Exception: Parse.satisfy: failed
 
 How can I rewrite the above example that it returns
   (abc*** Exception: Parse.satisfy: failed

The problem in your example is that the 'exactly' combinator forces
parsing of 'n' items before returning any of them.  If you roll your
own, then you can return partial results:

 let one = return (:) `apply` satisfy (Char.isAlpha)
  in runParser (one `apply` (one `apply`
   (one `apply` (one `apply` return []
 (abc104++undefined)
(abc*** Exception: Parse.satisfy: failed

Equivalently:

 let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f
  in runParser (one (one (one (one (return []) (abc104++undefined)
(abc*** Exception: Parse.satisfy: failed

Perhaps I should just rewrite the 'exactly' combinator to have the
behaviour you desire?  Its current definition is:

exactly 0 p = return []
exactly n p = do x - p
 xs - exactly (n-1) p
 return (x:xs)

and a lazier definition would be:

exactly 0 p = return []
exactly n p = return (:) `apply` p `apply` exactly (n-1) p

 How can I tell the parser that everything it parsed so
 far will not be invalidated by further input?

Essentially, you need to return a constructor as soon as you know that
the initial portion of parsed data is correct.  Often the only sensible
way to do that is to use the 'apply' combinator (as shown in the
examples above), returning a constructor _function_ which is lazily
applied to the remainder of the parsing task.

 I wondered whether 'commit' helps, but it didn't. (I thought it would
 convert a global 'fail' to a local 'error'.)

The 'commit' combinator is intended for early abortion of a parse
attempt that it is known can no longer succeed.  That's the opposite of
what you want.  By contrast, the 'apply' combinator causes a parse
attempt to succeed early, even though it may turn out to fail later.

Regards,
Malcolm
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-14 Thread Henning Thielemann

On Mon, 14 May 2007, Malcolm Wallace wrote:

 Perhaps I should just rewrite the 'exactly' combinator to have the
 behaviour you desire?  Its current definition is:

 exactly 0 p = return []
 exactly n p = do x - p
  xs - exactly (n-1) p
  return (x:xs)

Is there a difference between 'exactly' and 'replicateM' ?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-14 Thread Henning Thielemann

On Mon, 14 May 2007, Malcolm Wallace wrote:

 Essentially, you need to return a constructor as soon as you know that
 the initial portion of parsed data is correct.  Often the only sensible
 way to do that is to use the 'apply' combinator (as shown in the
 examples above), returning a constructor _function_ which is lazily
 applied to the remainder of the parsing task.

Great, 'apply' is the solution! I admit that I couldn't derive its power
from its documentation which simply states
 Apply a parsed function to a parsed value.

:-)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-14 Thread Malcolm Wallace
Henning Thielemann [EMAIL PROTECTED] wrote:

  exactly 0 p = return []
  exactly n p = do x - p
   xs - exactly (n-1) p
   return (x:xs)
 
 Is there a difference between 'exactly' and 'replicateM' ?

With this definition, clearly not.  But when rewritten to use lazy
application, there is certainly a pragmatic difference in where the
bottoms (if any) are located in the result.

Regards,
Malcolm
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-11 Thread Henning Thielemann

I want to parse and process HTML lazily. I use HXT because the HTML parser
is very liberal. However it uses Parsec and is thus strict. HaXML has a
so called lazy parser, but it is not what I consider lazy:

*Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse 
text $ htmlhead/headbody++undefined++/body/html
*** Exception: Prelude.undefined
*Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse 
text $ htmlhead/headbody/body/html
*** Exception: Expected / but found 
  at file text  at line 1 col 26

If it would be lazy, it would return some HTML code before the error.
HaXML uses the Polyparse package for parsing which contains a so called
lazy parser. However it has return type (Either String a). That is, for
the decision whether the parse was successful, the document has to be
parsed completely.

*Text.ParserCombinators.PolyLazy runParser (exactly 4 (satisfy Char.isAlpha)) 
(abc104++undefined)
(*** Exception: Parse.satisfy: failed

If it would have return type (String, a) it could return both a partial
value of type 'a' and the error message as String. It would be even better
if it has some handling for incorrect input texts, and returns ([String],
a), where [String] is the type of a list of warnings and error messages
and 'a' is the type of a total value of parse output.

Is there some parser of this type? Unfortunately
 http://www.haskell.org/haskellwiki/Applications_and_libraries/Compiler_tools
   does not compare the laziness of the mentioned parsers.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-11 Thread Henning Thielemann

On Fri, 11 May 2007, Neil Mitchell wrote:

 Depending on exactly what you want, TagSoup may be of interest to you.
 It is lazy, but it doesn't return a tree. It is very tollerant of
 errors, and will simply never fail to parse something.

 http://www-users.cs.york.ac.uk/~ndm/tagsoup/

That's an interesting option. It could be used as a lexer for a full-blown
HTML parser. Sometimes I need the tree structure. But why does this simple
piece of code needs -fglasgow-exts?


Thanks for the package and the hint!
 Henning
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-11 Thread Neil Mitchell

Hi


That's an interesting option. It could be used as a lexer for a full-blown
HTML parser. Sometimes I need the tree structure. But why does this simple
piece of code needs -fglasgow-exts?


It doesn't. The released version 0.1 doesn't require extensions, and
the next 0.2 won't either. In the meantime I accepted a patch from a
user that added a new feature and required the flag. I'm going to
rework it shortly, make a few tweaks, and remove that flag.

Thanks

Neil
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-11 Thread Jules Bean

Henning Thielemann wrote:

I want to parse and process HTML lazily. I use HXT because the HTML parser
is very liberal. However it uses Parsec and is thus strict. HaXML has a
so called lazy parser, but it is not what I consider lazy:

*Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse text $ 
htmlhead/headbody++undefined++/body/html
*** Exception: Prelude.undefined
*Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse text $ 
htmlhead/headbody/body/html
*** Exception: Expected / but found 
  at file text  at line 1 col 26

If it would be lazy, it would return some HTML code before the error.
  


Are you sure that it is the parser, that is not lazy, and it isn't that 
the pretty printer is overly strict?


From the evidence above the parser could be returning some results 
before the error, and the pretty printer strictly slurping it all up to 
the error and then dying.


Jules

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-11 Thread Henning Thielemann

On Fri, 11 May 2007, Jules Bean wrote:

 Henning Thielemann wrote:
  I want to parse and process HTML lazily. I use HXT because the HTML parser
  is very liberal. However it uses Parsec and is thus strict. HaXML has a
  so called lazy parser, but it is not what I consider lazy:
 
  *Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse 
  text $ htmlhead/headbody++undefined++/body/html
  *** Exception: Prelude.undefined
  *Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse 
  text $ htmlhead/headbody/body/html
  *** Exception: Expected / but found 
at file text  at line 1 col 26
 
  If it would be lazy, it would return some HTML code before the error.

 Are you sure that it is the parser, that is not lazy, and it isn't that
 the pretty printer is overly strict?

  From the evidence above the parser could be returning some results
 before the error, and the pretty printer strictly slurping it all up to
 the error and then dying.

I know, but the type of the Polyparse parser prohibits lazy parsing.
Unfortunately there is no Show instance for HaXML trees, so one cannot
easily see whether laziness gets lost in the parser or in the pretty
printer.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

2007-05-11 Thread Malcolm Wallace
Henning Thielemann [EMAIL PROTECTED] wrote:

 HaXml has a so called lazy parser, but it is not what I consider lazy:

Lazy parsing is rather subtle, and it is easy to write a too-strict
parser when one intended to be more lazy.  Equally, it can be easy to
imagine that the parser is too strict, when in fact it is the usage
context that is wrong.

You have indeed found some bugs in HaXml's lazy HTML parser, but you
have also partly misunderstood what lazy parsing means.

  Text.XML.HaXml.Pretty.document $ htmlParse text $
  htmlhead/headbody++undefined++/body/html
 *** Exception: Prelude.undefined

The problem here is not that the parser is too strict, but that the
pretty-printer is.  The pretty-printer is demanding an undefined portion
of the value before it produces any output.

 If it would be lazy, it would return some HTML code before the error.

It can do, but only if you consume the part without errors first!  For
instance, it would be safe to extract just the head tag, because that
is complete:

import Text.XML.HaXml
import Text.XML.HaXml.Posn
import Text.XML.HaXml.Html.ParseLazy as Lazy
import Text.XML.HaXml.Pretty as PP
import Text.PrettyPrint.HughesPJ

main = putStrLn $ render $ fsep PP.content $
   -- the following line extracts just the first child tag of html
   (\(Document _ _ e _)- (position 0 children) (CElem e nopos)) $
   Lazy.htmlParse text $
   htmlhead/headbody++undefined++/body/html

Unfortunately, this program currently does throw an undefined
exception, even though it should not.  The lazy HTML parser contains a
couple of tricky corners that _probably_ stop it from being lazy.
  (1) The element parser does not immediately return an element after
  seeing its start tag, because it also has to return a stack of
  improperly terminated elements inside this one (so they can be
  repaired).
  (2) After parsing, we simplify the tree structure, which of course
  traverses it, and may again force too much evaluation.
In any case, I will need to investigate further, and hopefully soon push
a patch to fix the problem.

 HaXML uses the Polyparse package for parsing which contains a so
 called lazy parser. However it has return type (Either String a). That
 is, for the decision whether the parse was successful, the document
 has to be parsed completely.

Not true.  PolyLazy.runParser has the signature
runParser :: Parser t a - [t] - (a, [t])
that is, it returns the partially parsed value (which may contain
bottoms), and the remaining unparsed token-stream.  (Examining either of
these return values may cause sufficient evaluation to be forced to lead
to a runtime exception.)  There is no 'Either' type at the user level.
(Although an Either is used internally, see below, it does not do what
you think).

 *Text.ParserCombinators.PolyLazy
   runParser (exactly 4 (satisfy Char.isAlpha)) (abc104++undefined)
 (*** Exception: Parse.satisfy: failed

This output is exactly correct.  You asked for the first four characters
provided that they were alphabetic, but in fact only the first three
were alphabetic.  Hence, 'satisfy' failed and threw an exception.  If
you ask for only the first three characters, then the parse succeeds:

   fst $ runParser (exactly 3 (satisfy Char.isAlpha))
(abc104++undefined)
  abc

The purpose of the internal Either type that you mentioned, is to permit
backtracking within the parse, not to force complete evaluation.  Thus,
you can equally ask for the first four characters provided they are
alphanumeric, where alphanumeric is decided by a combination of
alternate parsers:

   fst $ runParser (exactly 4 (satisfy Char.isAlpha `onFail`
satisfy Char.isDigit))
(abc104++undefined)
  abc1

This example illustrates that a parse failure is still recoverable when
parsing lazily (but only by another parser, not once the failure has
escaped the parsing world).

Regards,
Malcolm
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe