On Mon, 10/31/2011 5:48 PM, Stefan Sauer wrote:
On 09/18/2011 10:24 PM, Glen Hein wrote:
Hello,
I'm a software developer and I'd like to contribute to Gnome's XML project. I've used the libxml software for a long time and I'd like to
give something back.
I just started a voluntary career break, but I'd like to stay active.
I looked over the TODO file, but I'm not sure which item to tackle. Could you
recommend an item for someone new to the project?
Thanks,
Glen Hein
One thing that would be super cool would be multi-threaded xslt processing (e.g. for chunked document output). Unfortunately again, this
is not trivial at all. But any speedup for xslt processing would be great. The docbook xml -> html step in gtk-doc is so slow that most
developers to api-doc generation off still :/
Stefan
My vote is to add a generic XML sanitizer. Presumably it would correct syntax problems, escape special characters, etc. Once the data is
syntactically correct, the sanitizer could use a dtd/schema/xslt to add missing elements, or more importantly strip unwanted elements. The
obvious application is HTML. A web server could pass untrusted bytes into the sanitizer and get back a result that is both valid and safe.
Different levels/rules would be used to achieve different results.
Of course there are existing solutions, but everything I've found so far is written in PHP, Perl, Python, Java, et al. And most are written
as standalone command line tools. Launching a command line tool, particularly an executable that runs atop a virtual machine is very
inefficient, and difficult to scale. Having the functionality inside libxml2 means daemons that already use the library could easily
sanitize their output, and with relatively little overhead protect themselves from a number of potential problems.
A secondary goal would be the standardization of the dtd/schema/xslt rules that are used to sanitize HTML (and other XML formatted content).
Right now, every sanitizer uses a different set of rules, and looks for a different collection of exploits. If a new trick is discovered to
pass harmful data to clients, presumably by encapsulating it in a way that might be valid, but which gets parsed by some clients in a
"vendor specific" way, updating the standardized rules would allow all the saniziters to adapt without changing code...
Just my .02.
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml