Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
On Mon, 14 May 2007, Malcolm Wallace wrote: Henning Thielemann [EMAIL PROTECTED] wrote: *Text.ParserCombinators.PolyLazy runParser (exactly 4 (satisfy Char.isAlpha)) (abc104++undefined) (*** Exception: Parse.satisfy: failed How can I rewrite the above example that it returns (abc*** Exception: Parse.satisfy: failed The problem in your example is that the 'exactly' combinator forces parsing of 'n' items before returning any of them. If you roll your own, then you can return partial results: let one = return (:) `apply` satisfy (Char.isAlpha) in runParser (one `apply` (one `apply` (one `apply` (one `apply` return [] (abc104++undefined) (abc*** Exception: Parse.satisfy: failed Equivalently: let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f in runParser (one (one (one (one (return []) (abc104++undefined) (abc*** Exception: Parse.satisfy: failed I wonder whether 'apply' merges two separate ideas: Applying a generated function to some parser generated value and forcing some parser to always succeed. From the documentation of 'apply' I assumed that 'apply f x' fails if 'f' or 'x' fails. In contrast to that it seems to succeed if only 'f' succeeds. Wouldn't it be better to have an explicit 'force' which declares a parser to never fail - and to return 'undefined' if this assumption is wrong. I have seen this 'force' in the MIDI loader of Haskore: http://darcs.haskell.org/haskore/src/Haskore/General/Parser.hs It would hold: apply f x == do g - f; fmap g (force x) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
On Fri, 11 May 2007, Malcolm Wallace wrote: *Text.ParserCombinators.PolyLazy runParser (exactly 4 (satisfy Char.isAlpha)) (abc104++undefined) (*** Exception: Parse.satisfy: failed This output is exactly correct. You asked for the first four characters provided that they were alphabetic, but in fact only the first three were alphabetic. Hence, 'satisfy' failed and threw an exception. If you ask for only the first three characters, then the parse succeeds: The problem is obviously that a later wrong character can make the whole parse fail. Thus successful generated data is not returned until the whole input is parsed and checked. How can I suppress checking the whole input? How can I tell the parser that everything it parsed so far will not be invalidated by further input? How can I rewrite the above example that it returns (abc*** Exception: Parse.satisfy: failed ? I wondered whether 'commit' helps, but it didn't. (I thought it would convert a global 'fail' to a local 'error'.) *Text.ParserCombinators.PolyLazy runParser (exactly 4 (commit (satisfy Char.isAlpha))) (abc104++undefined) *** Exception: Parse.satisfy: failed ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
Henning Thielemann [EMAIL PROTECTED] wrote: *Text.ParserCombinators.PolyLazy runParser (exactly 4 (satisfy Char.isAlpha)) (abc104++undefined) (*** Exception: Parse.satisfy: failed How can I rewrite the above example that it returns (abc*** Exception: Parse.satisfy: failed The problem in your example is that the 'exactly' combinator forces parsing of 'n' items before returning any of them. If you roll your own, then you can return partial results: let one = return (:) `apply` satisfy (Char.isAlpha) in runParser (one `apply` (one `apply` (one `apply` (one `apply` return [] (abc104++undefined) (abc*** Exception: Parse.satisfy: failed Equivalently: let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f in runParser (one (one (one (one (return []) (abc104++undefined) (abc*** Exception: Parse.satisfy: failed Perhaps I should just rewrite the 'exactly' combinator to have the behaviour you desire? Its current definition is: exactly 0 p = return [] exactly n p = do x - p xs - exactly (n-1) p return (x:xs) and a lazier definition would be: exactly 0 p = return [] exactly n p = return (:) `apply` p `apply` exactly (n-1) p How can I tell the parser that everything it parsed so far will not be invalidated by further input? Essentially, you need to return a constructor as soon as you know that the initial portion of parsed data is correct. Often the only sensible way to do that is to use the 'apply' combinator (as shown in the examples above), returning a constructor _function_ which is lazily applied to the remainder of the parsing task. I wondered whether 'commit' helps, but it didn't. (I thought it would convert a global 'fail' to a local 'error'.) The 'commit' combinator is intended for early abortion of a parse attempt that it is known can no longer succeed. That's the opposite of what you want. By contrast, the 'apply' combinator causes a parse attempt to succeed early, even though it may turn out to fail later. Regards, Malcolm ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
On Mon, 14 May 2007, Malcolm Wallace wrote: Perhaps I should just rewrite the 'exactly' combinator to have the behaviour you desire? Its current definition is: exactly 0 p = return [] exactly n p = do x - p xs - exactly (n-1) p return (x:xs) Is there a difference between 'exactly' and 'replicateM' ? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
On Mon, 14 May 2007, Malcolm Wallace wrote: Essentially, you need to return a constructor as soon as you know that the initial portion of parsed data is correct. Often the only sensible way to do that is to use the 'apply' combinator (as shown in the examples above), returning a constructor _function_ which is lazily applied to the remainder of the parsing task. Great, 'apply' is the solution! I admit that I couldn't derive its power from its documentation which simply states Apply a parsed function to a parsed value. :-) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
Henning Thielemann [EMAIL PROTECTED] wrote: exactly 0 p = return [] exactly n p = do x - p xs - exactly (n-1) p return (x:xs) Is there a difference between 'exactly' and 'replicateM' ? With this definition, clearly not. But when rewritten to use lazy application, there is certainly a pragmatic difference in where the bottoms (if any) are located in the result. Regards, Malcolm ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy: *Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse text $ htmlhead/headbody++undefined++/body/html *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse text $ htmlhead/headbody/body/html *** Exception: Expected / but found at file text at line 1 col 26 If it would be lazy, it would return some HTML code before the error. HaXML uses the Polyparse package for parsing which contains a so called lazy parser. However it has return type (Either String a). That is, for the decision whether the parse was successful, the document has to be parsed completely. *Text.ParserCombinators.PolyLazy runParser (exactly 4 (satisfy Char.isAlpha)) (abc104++undefined) (*** Exception: Parse.satisfy: failed If it would have return type (String, a) it could return both a partial value of type 'a' and the error message as String. It would be even better if it has some handling for incorrect input texts, and returns ([String], a), where [String] is the type of a list of warnings and error messages and 'a' is the type of a total value of parse output. Is there some parser of this type? Unfortunately http://www.haskell.org/haskellwiki/Applications_and_libraries/Compiler_tools does not compare the laziness of the mentioned parsers. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
On Fri, 11 May 2007, Neil Mitchell wrote: Depending on exactly what you want, TagSoup may be of interest to you. It is lazy, but it doesn't return a tree. It is very tollerant of errors, and will simply never fail to parse something. http://www-users.cs.york.ac.uk/~ndm/tagsoup/ That's an interesting option. It could be used as a lexer for a full-blown HTML parser. Sometimes I need the tree structure. But why does this simple piece of code needs -fglasgow-exts? Thanks for the package and the hint! Henning ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
Hi That's an interesting option. It could be used as a lexer for a full-blown HTML parser. Sometimes I need the tree structure. But why does this simple piece of code needs -fglasgow-exts? It doesn't. The released version 0.1 doesn't require extensions, and the next 0.2 won't either. In the meantime I accepted a patch from a user that added a new feature and required the flag. I'm going to rework it shortly, make a few tweaks, and remove that flag. Thanks Neil ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
Henning Thielemann wrote: I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy: *Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse text $ htmlhead/headbody++undefined++/body/html *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse text $ htmlhead/headbody/body/html *** Exception: Expected / but found at file text at line 1 col 26 If it would be lazy, it would return some HTML code before the error. Are you sure that it is the parser, that is not lazy, and it isn't that the pretty printer is overly strict? From the evidence above the parser could be returning some results before the error, and the pretty printer strictly slurping it all up to the error and then dying. Jules ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
On Fri, 11 May 2007, Jules Bean wrote: Henning Thielemann wrote: I want to parse and process HTML lazily. I use HXT because the HTML parser is very liberal. However it uses Parsec and is thus strict. HaXML has a so called lazy parser, but it is not what I consider lazy: *Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse text $ htmlhead/headbody++undefined++/body/html *** Exception: Prelude.undefined *Text.XML.HaXml.Html.ParseLazy Text.XML.HaXml.Pretty.document $ htmlParse text $ htmlhead/headbody/body/html *** Exception: Expected / but found at file text at line 1 col 26 If it would be lazy, it would return some HTML code before the error. Are you sure that it is the parser, that is not lazy, and it isn't that the pretty printer is overly strict? From the evidence above the parser could be returning some results before the error, and the pretty printer strictly slurping it all up to the error and then dying. I know, but the type of the Polyparse parser prohibits lazy parsing. Unfortunately there is no Show instance for HaXML trees, so one cannot easily see whether laziness gets lost in the parser or in the pretty printer. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?
Henning Thielemann [EMAIL PROTECTED] wrote: HaXml has a so called lazy parser, but it is not what I consider lazy: Lazy parsing is rather subtle, and it is easy to write a too-strict parser when one intended to be more lazy. Equally, it can be easy to imagine that the parser is too strict, when in fact it is the usage context that is wrong. You have indeed found some bugs in HaXml's lazy HTML parser, but you have also partly misunderstood what lazy parsing means. Text.XML.HaXml.Pretty.document $ htmlParse text $ htmlhead/headbody++undefined++/body/html *** Exception: Prelude.undefined The problem here is not that the parser is too strict, but that the pretty-printer is. The pretty-printer is demanding an undefined portion of the value before it produces any output. If it would be lazy, it would return some HTML code before the error. It can do, but only if you consume the part without errors first! For instance, it would be safe to extract just the head tag, because that is complete: import Text.XML.HaXml import Text.XML.HaXml.Posn import Text.XML.HaXml.Html.ParseLazy as Lazy import Text.XML.HaXml.Pretty as PP import Text.PrettyPrint.HughesPJ main = putStrLn $ render $ fsep PP.content $ -- the following line extracts just the first child tag of html (\(Document _ _ e _)- (position 0 children) (CElem e nopos)) $ Lazy.htmlParse text $ htmlhead/headbody++undefined++/body/html Unfortunately, this program currently does throw an undefined exception, even though it should not. The lazy HTML parser contains a couple of tricky corners that _probably_ stop it from being lazy. (1) The element parser does not immediately return an element after seeing its start tag, because it also has to return a stack of improperly terminated elements inside this one (so they can be repaired). (2) After parsing, we simplify the tree structure, which of course traverses it, and may again force too much evaluation. In any case, I will need to investigate further, and hopefully soon push a patch to fix the problem. HaXML uses the Polyparse package for parsing which contains a so called lazy parser. However it has return type (Either String a). That is, for the decision whether the parse was successful, the document has to be parsed completely. Not true. PolyLazy.runParser has the signature runParser :: Parser t a - [t] - (a, [t]) that is, it returns the partially parsed value (which may contain bottoms), and the remaining unparsed token-stream. (Examining either of these return values may cause sufficient evaluation to be forced to lead to a runtime exception.) There is no 'Either' type at the user level. (Although an Either is used internally, see below, it does not do what you think). *Text.ParserCombinators.PolyLazy runParser (exactly 4 (satisfy Char.isAlpha)) (abc104++undefined) (*** Exception: Parse.satisfy: failed This output is exactly correct. You asked for the first four characters provided that they were alphabetic, but in fact only the first three were alphabetic. Hence, 'satisfy' failed and threw an exception. If you ask for only the first three characters, then the parse succeeds: fst $ runParser (exactly 3 (satisfy Char.isAlpha)) (abc104++undefined) abc The purpose of the internal Either type that you mentioned, is to permit backtracking within the parse, not to force complete evaluation. Thus, you can equally ask for the first four characters provided they are alphanumeric, where alphanumeric is decided by a combination of alternate parsers: fst $ runParser (exactly 4 (satisfy Char.isAlpha `onFail` satisfy Char.isDigit)) (abc104++undefined) abc1 This example illustrates that a parse failure is still recoverable when parsing lazily (but only by another parser, not once the failure has escaped the parsing world). Regards, Malcolm ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe