On 11/10/11 8:48 PM, Daniel Veillard wrote:
   Well the canonical way is HTML tidy from Dave Ragett (though
he seems to have stepped down) http://tidy.sourceforge.net/

Tidy was a great tool, but the original code hasn't been updatd in three years. Replacements have come along, but I haven't found anything in C that could integrated into a daemon.
   One of the real development goals that could still make sense
in libxml2 is to make the HTML parser behave like an HTML 5 one
(or allow this as an option), there is already shared code for HTML5
parsing but it's C++ (IIRC) and I can't rely on it. If people start
to agree a bit formally on how to parse "web HTML" i.e. the ignomous
mixtures that most Web parser are built to process, and handle all
corner cases in a consistent documented way, then upgrading libxml2
to behave in the same way as much as possible would be *great*, but
that would definitely be a lot of work, and I can't commit to anything
like this :-)
   The interesting point in this approach is that it doesn't have to
be 6 months of continous work to produce results, this could be achieved
progressively, adding an HTML_PARSE_HTML5 flag to htmlParserOption
and adding fixes as we meet them and decide to fix them to the
existing HTML parser.
The HTML5 draft goes into the 'rules' for cleaning up malformed 'fragments'. But its too dense for me to think about a libxml2 integration:

http://www.w3.org/TR/2011/WD-html5-20110525/parsing.html#parsing

I could help write unit tests if someone wants to make an attempt? Once the parser is written a slim command line interface to it could be the future replacement for HTML tidy.

Once an HTML fragment has been processed into a 'sane' state, sanitizing using xpath/xslt rules is feasible.


_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to