Hi, Assaf,

I believe you've suggested using Tidy/JTidy to preprocess the HTML before
parsing. Possible to integrate JTidy or applies Tidy's rules in the HTML parser?

Assaf Arkin wrote:

>
> If a tag is not closed but it's parent is closed, the tag will be
> forcefully closed and an error issued (but will not stop the parser). If
> the tag is optional closing (like P), no error will be issued. If the
> tag is explicitly closed (e.g. LI closes another LI, /UL and /OL close
> any open LI) it will be properly dealt with.
>
> HTML and BODY tags are always created whether they exist or not in the
> file.
>
> This is all taken care of and is most of what the HTML parser is
> supposed to do, as opposed to an XML parser which demands well formed
> documents.
>
> As for overlapping <b>, <i> and <form> (tricky), I use the DOM
> normalization and not any specific approach taken by any one parser.
> It's a bit easier for a parser to work with <b>/<i> since it need not
> create a DOM but just fontify text sections.
>
> arkin
>

Reply via email to