Re: [xml] Parsing tag-soup HTML

Daniel Veillard Mon, 18 Jun 2007 05:14:08 -0700

On Mon, Jun 18, 2007 at 11:07:42AM +0100, Nick Kew wrote:
> On Sun, 17 Jun 2007 11:42:08 -0400
> Daniel Veillard <[EMAIL PROTECTED]> wrote:
> 
> >  Coming back with some kind of definition of what a tag soup parser
> > behaviour is is probably more important than digging in libxml2 code.
> 
> A slightly circular argument in this case.  What I really need to
> do is review the case history of what users complain about, and
> relate that to how the parser works.  Bear in mind this is a 
> streaming SAX parser: other APIs are way too slow and therefore
> of no interest in this context.


  Out of context. I wonder why you think the reader would be that
much slower. I did only XML tests but the cost was within 20% of the
SAX parsing speed.

> If I write a new parser from scratch, it'll be a simpleminded thing
> based on what bad tag-soup "html" expects:
> 
>   <foo ...> generates a startElement event
>   </foo> generates an EndElement event
>   <!-- generates a start-comment which is terminated by -->
>   <script> and <style> treat their contents as a black-box
>   terminated by </script>/</style> and nothing else.
> 
> With libxml2 we can add value to that by inserting implied closing
> tags.  But in some cases, we need to avoid inserting implied opening
> tags.  And we should dispense with some error corrections such as 
> rejecting an <html> opening tag after a document has opened.
> In fact, I think we need to dispense with generating *any* implied
> opening tags when in tag-soup mode.  Which in turn means we can't
> imply closing tags, lest they be unmatched!
> 
> So in terms of a first-iteration draft wishlist, tag-soup mode should:
>   - avoid inserting any implied tags in a SAX parse

  That would be contrary to what Tag Soup actually means for most people
as I pointed out.

>   - treat contents of <script></script> and <style></style> as raw
>     CDATA, and don't parse it.

  You need *some* parsing just to detect the end of tag, and now you're 
back to the origin, what criteria will you keep

    </
    </sc
    </script
    </script>
    </SCRIPT
    </ScRIpT
    </SCRIPT >
 
 ?
    
> > which it seems would defeat your first example I guess.
> > The problem really is to try to come back to a set of garantees and 
> > behavior rules. Reading the slides pointed from the end of that page
> > may help. But I'm not sure it's what you want, but since you use the
> > same name, it should hopefully be close.
> 
> Sounds like he's using "tag soup" to mean something that cleans it up,
> in the tradition of Tidy or AccessValet.  I'm contemplating the exact
> opposite: something that leaves it intact!

  And I think as an API you just can't ! You will break apps if you deliver
    <em> aaa <b> bbb </em> ccc </b>
 as 2 opening tag and then 2 closing tag but inverted.
Seems what you want is textual transformation only, and in that case a parser
doesn't sound like the best tool to implement this. But maybe I misunderstand.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Parsing tag-soup HTML

Reply via email to