Re: [xml] i'm here to contribute
On 11/11/2011 03:57 AM, Daniel Veillard wrote: On Mon, Oct 31, 2011 at 11:48:54PM +0100, Stefan Sauer wrote: On 09/18/2011 10:24 PM, Glen Hein wrote: Hello, I'm a software developer and I'd like to contribute to Gnome's XML project. I've used the libxml software for a long time and I'd like to give something back. I just started a voluntary career break, but I'd like to stay active. I looked over the TODO file, but I'm not sure which item to tackle. Could you recommend an item for someone new to the project? Thanks, Glen Hein One thing that would be super cool would be multi-threaded xslt processing (e.g. for chunked document output). Unfortunately again, this is not trivial at all. But any speedup for xslt processing would be great. The docbook xml - html step in gtk-doc is so slow that most developers to api-doc generation off still :/ Processing chunks in a subthread is an interesting idea. The stylesheet is read-only from a transformation process POV so that may work without too much crazyness... Two suggestions: - what about a reduced simplified DocBook XSLT for gnome, using only what you care about, that could be packaged and registered in the XML Catalog, and potentially simplify the processing running an xsltproc -v on a number of documents and grepping the results may lead to interesting results (but that will be voluminous !). I spend a few evenings on trying to make a xsltpp (preprocessor), where I even got stuck at finding api to save a stylesheet back to disk. The idea here was to load a stylesheet, do all the xinlcudes and then do optimisation passes (like substituting parameters, removing unsed templates, branches, ...) and save that as a preprocessed stylesheet. I did some of this manually and it gives some impressive speedups. - check where the time is really spent, is that in the XPath engine ? I used to kcachegrind transformation on DocBook and try to find what were the hotspot, I think I had that flattened at the time (6-7 years ago) but with new stylesheets it's possible there is new troubles, as was pointed out recently. Thats why I wrote the profiler (that I committed in the meantime). On the xslt side using oprofile shows some effects of what is described in the XPath performance issues thread and then a lot of cases where each function is fast, but simply called way too often. Stefan Daniel ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml
Re: [xml] i'm here to contribute
On Tue, Nov 01, 2011 at 11:33:44AM +0100, Thomas Schraitle wrote: Hi Stefan, Am Montag, 31. Oktober 2011, 23:48:54 schrieb Stefan Sauer: [...] One thing that would be super cool would be multi-threaded xslt processing (e.g. for chunked document output). Unfortunately again, this is not trivial at all. But any speedup for xslt processing would be great. The docbook xml - html step in gtk-doc is so slow that most developers to api-doc generation off still :/ I've learned some days ago that Saxon9 has already thread support. With the new DocBook stylesheets written in XSLT2.0, it is pretty fast. (Note: these are work in progress.) Yes, it would be super-cool to have that in xsltproc as well. Especially as the trend goes from XSLT1.0 to XSLT2.0. However, I know, this is not trivial at all. Well I doubt libxslt will be upgraded to 2.0, as I pointed out before I really don't have the time for such a massive development, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ dan...@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/ ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml
Re: [xml] i'm here to contribute
On 11/10/11 8:48 PM, Daniel Veillard wrote: Well the canonical way is HTML tidy from Dave Ragett (though he seems to have stepped down) http://tidy.sourceforge.net/ Tidy was a great tool, but the original code hasn't been updatd in three years. Replacements have come along, but I haven't found anything in C that could integrated into a daemon. One of the real development goals that could still make sense in libxml2 is to make the HTML parser behave like an HTML 5 one (or allow this as an option), there is already shared code for HTML5 parsing but it's C++ (IIRC) and I can't rely on it. If people start to agree a bit formally on how to parse web HTML i.e. the ignomous mixtures that most Web parser are built to process, and handle all corner cases in a consistent documented way, then upgrading libxml2 to behave in the same way as much as possible would be *great*, but that would definitely be a lot of work, and I can't commit to anything like this :-) The interesting point in this approach is that it doesn't have to be 6 months of continous work to produce results, this could be achieved progressively, adding an HTML_PARSE_HTML5 flag to htmlParserOption and adding fixes as we meet them and decide to fix them to the existing HTML parser. The HTML5 draft goes into the 'rules' for cleaning up malformed 'fragments'. But its too dense for me to think about a libxml2 integration: http://www.w3.org/TR/2011/WD-html5-20110525/parsing.html#parsing I could help write unit tests if someone wants to make an attempt? Once the parser is written a slim command line interface to it could be the future replacement for HTML tidy. Once an HTML fragment has been processed into a 'sane' state, sanitizing using xpath/xslt rules is feasible. ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml
Re: [xml] i'm here to contribute
On 09/18/2011 10:24 PM, Glen Hein wrote: Hello, I'm a software developer and I'd like to contribute to Gnome's XML project. I've used the libxml software for a long time and I'd like to give something back. I just started a voluntary career break, but I'd like to stay active. I looked over the TODO file, but I'm not sure which item to tackle. Could you recommend an item for someone new to the project? Thanks, Glen Hein One thing that would be super cool would be multi-threaded xslt processing (e.g. for chunked document output). Unfortunately again, this is not trivial at all. But any speedup for xslt processing would be great. The docbook xml - html step in gtk-doc is so slow that most developers to api-doc generation off still :/ Stefan ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml