Re: [xml] Parsing tag-soup HTML

Nick Kew Mon, 18 Jun 2007 06:02:47 -0700

On Mon, 18 Jun 2007 08:14:01 -0400
Daniel Veillard <[EMAIL PROTECTED]> wrote:



>   Out of context. I wonder why you think the reader would be that
> much slower. I did only XML tests but the cost was within 20% of the
> SAX parsing speed.

Because it lacks a ParseChunk API, which means it can't work with
Apache's pipelined filter architecture.  Unless you've added
such an API since I last looked.

> > So in terms of a first-iteration draft wishlist, tag-soup mode
> > should:
> >   - avoid inserting any implied tags in a SAX parse
> 
>   That would be contrary to what Tag Soup actually means for most
> people as I pointed out.

OK, consider the example referenced from my blog in my first post,
coming from a microsoft sharepoint backend, which inserted a bogus
<meta> at the top.

Try running the following through "xmllint --html":

<meta http-equiv="content-type" content="text/html;charset=ascii" />
<html lang="en">
<head><title>foo</title></head>
<body><h1>Hello, World</h1></body>
</html>

and it becomes:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
<head><meta http-equiv="content-type"
content="text/html;charset=ascii"></head>
<body>
<p> lang="en"&gt;
</p>
<title>foo</title>
<h1>Hello, World</h1>
</body>
</html>

>From the point of view of the user, that's worse than the original,
because real-life browsers will render that first bogus paragraph.
It's because of examples like that that I want to make it a
configurable option NOT to insert any inferred tags.
 

> >   - treat contents of <script></script> and <style></style> as raw
> >     CDATA, and don't parse it.
> 
>   You need *some* parsing just to detect the end of tag, and now
> you're back to the origin, what criteria will you keep
> 
>     </
>     </sc
>     </script
>     </script>
>     </SCRIPT
>     </ScRIpT
>     </SCRIPT >
>  
>  ?

Case-insensitive "</script" is the token to look for.
Having found it, we then look for ">" preceded by zero or
more whitespace chars.

Yes, that'll still screw up on document.write('</script>').
Needs more thought.  But at least it will leave things like
<script>
    document.write('<p>Something</p>');
</script>
intact.

> > Sounds like he's using "tag soup" to mean something that cleans it
> > up, in the tradition of Tidy or AccessValet.  I'm contemplating the
> > exact opposite: something that leaves it intact!
> 
>   And I think as an API you just can't ! You will break apps if you
> deliver <em> aaa <b> bbb </em> ccc </b>
>  as 2 opening tag and then 2 closing tag but inverted.

Cases like that don't seem to hit my inbox.  I guess that's because
even frontpage-weenies don't product code like that (or if they do,
they can see what's wrong for themselves).

> Seems what you want is textual transformation only, and in that case
> a parser doesn't sound like the best tool to implement this. But
> maybe I misunderstand.

Yes, you could be right.  That's the other option.

I already have a simple sed-like filter (mod_line_edit), which
offers a fallback to users with hopelessly broken markup they
can't do anything about.  But that loses the point and the power
of a markup-aware parser generating a stream of events.

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Parsing tag-soup HTML

Reply via email to