On Mon, 18 Jun 2007 08:14:01 -0400 Daniel Veillard <[EMAIL PROTECTED]> wrote:
> Out of context. I wonder why you think the reader would be that > much slower. I did only XML tests but the cost was within 20% of the > SAX parsing speed. Because it lacks a ParseChunk API, which means it can't work with Apache's pipelined filter architecture. Unless you've added such an API since I last looked. > > So in terms of a first-iteration draft wishlist, tag-soup mode > > should: > > - avoid inserting any implied tags in a SAX parse > > That would be contrary to what Tag Soup actually means for most > people as I pointed out. OK, consider the example referenced from my blog in my first post, coming from a microsoft sharepoint backend, which inserted a bogus <meta> at the top. Try running the following through "xmllint --html": <meta http-equiv="content-type" content="text/html;charset=ascii" /> <html lang="en"> <head><title>foo</title></head> <body><h1>Hello, World</h1></body> </html> and it becomes: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head><meta http-equiv="content-type" content="text/html;charset=ascii"></head> <body> <p> lang="en"> </p> <title>foo</title> <h1>Hello, World</h1> </body> </html> >From the point of view of the user, that's worse than the original, because real-life browsers will render that first bogus paragraph. It's because of examples like that that I want to make it a configurable option NOT to insert any inferred tags. > > - treat contents of <script></script> and <style></style> as raw > > CDATA, and don't parse it. > > You need *some* parsing just to detect the end of tag, and now > you're back to the origin, what criteria will you keep > > </ > </sc > </script > </script> > </SCRIPT > </ScRIpT > </SCRIPT > > > ? Case-insensitive "</script" is the token to look for. Having found it, we then look for ">" preceded by zero or more whitespace chars. Yes, that'll still screw up on document.write('</script>'). Needs more thought. But at least it will leave things like <script> document.write('<p>Something</p>'); </script> intact. > > Sounds like he's using "tag soup" to mean something that cleans it > > up, in the tradition of Tidy or AccessValet. I'm contemplating the > > exact opposite: something that leaves it intact! > > And I think as an API you just can't ! You will break apps if you > deliver <em> aaa <b> bbb </em> ccc </b> > as 2 opening tag and then 2 closing tag but inverted. Cases like that don't seem to hit my inbox. I guess that's because even frontpage-weenies don't product code like that (or if they do, they can see what's wrong for themselves). > Seems what you want is textual transformation only, and in that case > a parser doesn't sound like the best tool to implement this. But > maybe I misunderstand. Yes, you could be right. That's the other option. I already have a simple sed-like filter (mod_line_edit), which offers a fallback to users with hopelessly broken markup they can't do anything about. But that loses the point and the power of a markup-aware parser generating a stream of events. -- Nick Kew Application Development with Apache - the Apache Modules Book http://www.apachetutor.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
