Henri Sivonen wrote:

The Validator.nu HTML Parser comes with a sample application called HTML2XML. When the input is a conforming HTML5 document, the output is the semantically equivalent XHTML5 document. HTML2XML doesn't repair non-conforming documents.

You can obtain the Java version from http://about.validator.nu/htmlparser/

Sam Ruby is working on a version that doesn't require the JVM invocation overhead
http://intertwingly.net/blog/2009/06/15/Invoking-HtmlParser-from-C

If your pipeline is in Java, you don't need HTML2XML but you should just use the Validator.nu HTML Parser directly, which optimizes away the steps of serializing as XML and reparsing it.

Update: I'm working on that too: http://intertwingly.net/blog/2009/06/17/Calling-JAXP-from-Ruby

Jonathan: I will echo what Henri says. Except for edge cases, HTML5 parsers and serializers can simply be considered a 'drop in' replacement for XML parsers and serializers. Every effort has been made to ensure that the edge cases are as small as possible. And the cases where the differences are unavoidable are clearly documented. Apparently Henri's favorite example is form feed characters. Mine is consecutive dashes in comments.

- Sam Ruby

Reply via email to