On Mon, Jun 18, 2007 at 11:07:42AM +0100, Nick Kew wrote:
> On Sun, 17 Jun 2007 11:42:08 -0400
> Daniel Veillard <[EMAIL PROTECTED]> wrote:
>
> > Coming back with some kind of definition of what a tag soup parser
> > behaviour is is probably more important than digging in libxml2 code.
>
> A slightly circular argument in this case. What I really need to
> do is review the case history of what users complain about, and
> relate that to how the parser works. Bear in mind this is a
> streaming SAX parser: other APIs are way too slow and therefore
> of no interest in this context.
Out of context. I wonder why you think the reader would be that
much slower. I did only XML tests but the cost was within 20% of the
SAX parsing speed.
> If I write a new parser from scratch, it'll be a simpleminded thing
> based on what bad tag-soup "html" expects:
>
> <foo ...> generates a startElement event
> </foo> generates an EndElement event
> <!-- generates a start-comment which is terminated by -->
> <script> and <style> treat their contents as a black-box
> terminated by </script>/</style> and nothing else.
>
> With libxml2 we can add value to that by inserting implied closing
> tags. But in some cases, we need to avoid inserting implied opening
> tags. And we should dispense with some error corrections such as
> rejecting an <html> opening tag after a document has opened.
> In fact, I think we need to dispense with generating *any* implied
> opening tags when in tag-soup mode. Which in turn means we can't
> imply closing tags, lest they be unmatched!
>
> So in terms of a first-iteration draft wishlist, tag-soup mode should:
> - avoid inserting any implied tags in a SAX parse
That would be contrary to what Tag Soup actually means for most people
as I pointed out.
> - treat contents of <script></script> and <style></style> as raw
> CDATA, and don't parse it.
You need *some* parsing just to detect the end of tag, and now you're
back to the origin, what criteria will you keep
</
</sc
</script
</script>
</SCRIPT
</ScRIpT
</SCRIPT >
?
> > which it seems would defeat your first example I guess.
> > The problem really is to try to come back to a set of garantees and
> > behavior rules. Reading the slides pointed from the end of that page
> > may help. But I'm not sure it's what you want, but since you use the
> > same name, it should hopefully be close.
>
> Sounds like he's using "tag soup" to mean something that cleans it up,
> in the tradition of Tidy or AccessValet. I'm contemplating the exact
> opposite: something that leaves it intact!
And I think as an API you just can't ! You will break apps if you deliver
<em> aaa <b> bbb </em> ccc </b>
as 2 opening tag and then 2 closing tag but inverted.
Seems what you want is textual transformation only, and in that case a parser
doesn't sound like the best tool to implement this. But maybe I misunderstand.
Daniel
--
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
[EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml