Henri Sivonen wrote:
The Validator.nu HTML Parser comes with a sample application called
HTML2XML. When the input is a conforming HTML5 document, the output is
the semantically equivalent XHTML5 document. HTML2XML doesn't repair
non-conforming documents.
You can obtain the Java version from http://about.validator.nu/htmlparser/
Sam Ruby is working on a version that doesn't require the JVM invocation
overhead
http://intertwingly.net/blog/2009/06/15/Invoking-HtmlParser-from-C
If your pipeline is in Java, you don't need HTML2XML but you should just
use the Validator.nu HTML Parser directly, which optimizes away the
steps of serializing as XML and reparsing it.
Update: I'm working on that too:
http://intertwingly.net/blog/2009/06/17/Calling-JAXP-from-Ruby
Jonathan: I will echo what Henri says. Except for edge cases, HTML5
parsers and serializers can simply be considered a 'drop in' replacement
for XML parsers and serializers. Every effort has been made to ensure
that the edge cases are as small as possible. And the cases where the
differences are unavoidable are clearly documented. Apparently Henri's
favorite example is form feed characters. Mine is consecutive dashes in
comments.
- Sam Ruby