Jumping back on that old thread, now that I have a bit of time for
the xml mail folder :-)

On Mon, Oct 31, 2011 at 07:27:43PM -0500, Ladar Levison wrote:
> On Mon, 10/31/2011 5:48 PM, Stefan Sauer wrote:
> >On 09/18/2011 10:24 PM, Glen Hein wrote:
[...]
> My vote is to add a generic XML sanitizer. Presumably it would
> correct syntax problems, escape special characters, etc. Once the
> data is syntactically correct, the sanitizer could use a
> dtd/schema/xslt to add missing elements, or more importantly strip
> unwanted elements. The obvious application is HTML. A web server
> could pass untrusted bytes into the sanitizer and get back a result
> that is both valid and safe. Different levels/rules would be used to
> achieve different results.

  Well the canonical way is HTML tidy from Dave Ragett (though
he seems to have stepped down) http://tidy.sourceforge.net/

> Of course there are existing solutions, but everything I've found so
> far is written in PHP, Perl, Python, Java, et al. And most are
> written as standalone command line tools. Launching a command line
> tool, particularly an executable that runs atop a virtual machine is
> very inefficient, and difficult to scale. Having the functionality
> inside libxml2 means daemons that already use the library could
> easily sanitize their output, and with relatively little overhead
> protect themselves from a number of potential problems.
> 
> A secondary goal would be the standardization of the dtd/schema/xslt
> rules that are used to sanitize HTML (and other XML formatted
> content). Right now, every sanitizer uses a different set of rules,
> and looks for a different collection of exploits. If a new trick is
> discovered to pass harmful data to clients, presumably by
> encapsulating it in a way that might be valid, but which gets parsed
> by some clients in a "vendor specific" way, updating the
> standardized rules would allow all the saniziters to adapt without
> changing code...

  One of the real development goals that could still make sense
in libxml2 is to make the HTML parser behave like an HTML 5 one
(or allow this as an option), there is already shared code for HTML5
parsing but it's C++ (IIRC) and I can't rely on it. If people start
to agree a bit formally on how to parse "web HTML" i.e. the ignomous
mixtures that most Web parser are built to process, and handle all
corner cases in a consistent documented way, then upgrading libxml2
to behave in the same way as much as possible would be *great*, but
that would definitely be a lot of work, and I can't commit to anything
like this :-)
  The interesting point in this approach is that it doesn't have to
be 6 months of continous work to produce results, this could be achieved
progressively, adding an HTML_PARSE_HTML5 flag to htmlParserOption
and adding fixes as we meet them and decide to fix them to the
existing HTML parser.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
dan...@veillard.com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to