Jumping back on that old thread, now that I have a bit of time for the xml mail folder :-)
On Mon, Oct 31, 2011 at 07:27:43PM -0500, Ladar Levison wrote: > On Mon, 10/31/2011 5:48 PM, Stefan Sauer wrote: > >On 09/18/2011 10:24 PM, Glen Hein wrote: [...] > My vote is to add a generic XML sanitizer. Presumably it would > correct syntax problems, escape special characters, etc. Once the > data is syntactically correct, the sanitizer could use a > dtd/schema/xslt to add missing elements, or more importantly strip > unwanted elements. The obvious application is HTML. A web server > could pass untrusted bytes into the sanitizer and get back a result > that is both valid and safe. Different levels/rules would be used to > achieve different results. Well the canonical way is HTML tidy from Dave Ragett (though he seems to have stepped down) http://tidy.sourceforge.net/ > Of course there are existing solutions, but everything I've found so > far is written in PHP, Perl, Python, Java, et al. And most are > written as standalone command line tools. Launching a command line > tool, particularly an executable that runs atop a virtual machine is > very inefficient, and difficult to scale. Having the functionality > inside libxml2 means daemons that already use the library could > easily sanitize their output, and with relatively little overhead > protect themselves from a number of potential problems. > > A secondary goal would be the standardization of the dtd/schema/xslt > rules that are used to sanitize HTML (and other XML formatted > content). Right now, every sanitizer uses a different set of rules, > and looks for a different collection of exploits. If a new trick is > discovered to pass harmful data to clients, presumably by > encapsulating it in a way that might be valid, but which gets parsed > by some clients in a "vendor specific" way, updating the > standardized rules would allow all the saniziters to adapt without > changing code... One of the real development goals that could still make sense in libxml2 is to make the HTML parser behave like an HTML 5 one (or allow this as an option), there is already shared code for HTML5 parsing but it's C++ (IIRC) and I can't rely on it. If people start to agree a bit formally on how to parse "web HTML" i.e. the ignomous mixtures that most Web parser are built to process, and handle all corner cases in a consistent documented way, then upgrading libxml2 to behave in the same way as much as possible would be *great*, but that would definitely be a lot of work, and I can't commit to anything like this :-) The interesting point in this approach is that it doesn't have to be 6 months of continous work to produce results, this could be achieved progressively, adding an HTML_PARSE_HTML5 flag to htmlParserOption and adding fixes as we meet them and decide to fix them to the existing HTML parser. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ dan...@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml