RE: [ANNOUNCE] Xerces HTML Parser

Mikko Honkala Fri, 15 Feb 2002 04:07:27 -0800

Hi Andy,

your XNI HTML parser seems very interesting. I think it would be worth to integrate it 
with Xerces, or at least release it under
Apache-style licence.


Regards,

        Mikko Honkala
        www.x-smiles.org

PS. (We are currently using tidy-j in our open source prototype XML browser X-Smiles, 
but would be quite interested to replace it
with a more efficient solution such as yours.)


> -----Original Message-----
> From: Andy Clark [mailto:[EMAIL PROTECTED]]
> Sent: 15. helmikuuta 2002 3:03
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: [ANNOUNCE] Xerces HTML Parser
>
>
> Joseph Kesselman/CAM/Lotus wrote:
> > One question: A huge percentage of the files out there which claim to be
> > HTML aren't, or at least aren't correct HTML. Browsers are generally very
> > forgiving and attempt to read past those errors.... but exactly how they
> > recover varies from browser to browser, so consistancy of respose to those
> > documents is a problem. Does this HTML prototype attempt that kind of
> > recovery? Should it? And if it should, does it doecument what approach it's
> > using so it can be compered with the various browsers and/or W3C's "tidy"
> > tool?
>
> The intention of the NekoHTML parser was to write an example
> that uses the Xerces Native Interface to show how easily other
> types of parsers can be written using the framework. But being
> able to parse HTML files is quite useful beyond just being an
> example of XNI.
>
> One of my goals was to make the parser operate in a serial
> manner. Because NekoHTML doesn't buffer the document content,
> it cannot clean up the document as much as a tool like Tidy.
> However, it uses much less memory than Tidy or other equivalents
> while being able to fix up most of the common problems.
>
> Another benefit of writing the parser using the XNI framework
> is that the codebase can remain incredibly small. The parser can
> generate XNI events and work with all of the existing (and future)
> XNI tools. For example, I don't have to write any code to create
> DOM, JDOM, or DOM4J trees; emit SAX events; or serialize the
> document to a file. I just plug it in and it works.
>
> But back to your question...
>
> I don't claim to clean documents a certain way; the goal is
> just to produce a balanced well-formed document. This work,
> though, is done by the tag balancer -- the scanner just
> tokenizes the input. By separating the tag balancing code
> into an XNI component in the document pipeline, I could
> certainly write different kinds of balancers that attempt
> to clean up the events in their own way. But I don't try
> to do it the Microsoft IE way or the Netscape Navigator
> way. However, I should document better *how* I do my
> particular brand of tag balancing.
>
> The parser will not be able to handle incredibly bad HTML
> documents. But I hope it hits the sweet spot of existing
> documents. I've run it on a number of major websites that
> have their own sets of problems (CNN, Slashdot, etc) and
> it handles them pretty well.
>
> So I would like people to try it out and let me know
> whether it's worth integrating into Xerces-J.
>
> --
> Andy Clark * [EMAIL PROTECTED]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [ANNOUNCE] Xerces HTML Parser

Reply via email to