Re: [xml] i'm here to contribute

2011-11-23 Thread Stefan Sauer
On 11/11/2011 03:57 AM, Daniel Veillard wrote:
 On Mon, Oct 31, 2011 at 11:48:54PM +0100, Stefan Sauer wrote:
 On 09/18/2011 10:24 PM, Glen Hein wrote:
 Hello,

 I'm a software developer and I'd like to contribute to Gnome's XML
 project. I've used the libxml software for a long time and I'd like to
 give something back.

 I just started a voluntary career break, but I'd like to stay active.

 I looked over the TODO file, but I'm not sure which item to tackle.
 Could you recommend an item for someone new to the project?

 Thanks,
 Glen Hein

 One thing that would be super cool would be multi-threaded xslt
 processing (e.g. for chunked document output). Unfortunately again, this
 is not trivial at all. But any speedup for xslt processing would be
 great. The docbook xml - html step in gtk-doc is so slow that most
 developers to api-doc generation off still :/
Processing chunks in a subthread is an interesting idea. The
 stylesheet is read-only from a transformation process POV so that
 may work without too much crazyness...
Two suggestions:
 - what about a reduced simplified DocBook XSLT for gnome, using
   only what you care about, that could be packaged and registered
   in the XML Catalog, and potentially simplify the processing
   running an xsltproc -v on a number of documents and grepping the
   results may lead to interesting results (but that will be
   voluminous !).
I spend a few evenings on trying to make a xsltpp (preprocessor), where
I even got stuck at finding api to save a stylesheet back to disk. The
idea here was to load a stylesheet, do all the xinlcudes and then do
optimisation passes (like substituting parameters, removing unsed
templates, branches, ...) and save that as a preprocessed stylesheet. I
did some of this manually and it gives some impressive speedups.

 - check where the time is really spent, is that in the XPath engine ?
   I used to kcachegrind transformation on DocBook and try to find
   what were the hotspot, I think I had that flattened at the time
   (6-7 years ago) but with new stylesheets it's possible there
   is new troubles, as was pointed out recently.
Thats why I wrote the profiler (that I committed in the meantime). On
the xslt side using oprofile shows some effects of what is described in
the XPath performance issues thread and then a lot of cases where each
function is fast, but simply called way too often.

Stefan
 Daniel


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] i'm here to contribute

2011-11-10 Thread Daniel Veillard
On Tue, Nov 01, 2011 at 11:33:44AM +0100, Thomas Schraitle wrote:
 Hi Stefan,
 
 Am Montag, 31. Oktober 2011, 23:48:54 schrieb Stefan Sauer:
  [...]
  
  One thing that would be super cool would be multi-threaded xslt
  processing (e.g. for chunked document output). Unfortunately again, this
  is not trivial at all. But any speedup for xslt processing would be
  great. The docbook xml - html step in gtk-doc is so slow that most
  developers to api-doc generation off still :/
 
 I've learned some days ago that Saxon9 has already thread support. With the 
 new DocBook stylesheets written in XSLT2.0, it is pretty fast. (Note: these 
 are work in progress.)
 
 Yes, it would be super-cool to have that in xsltproc as well. Especially as 
 the trend goes from XSLT1.0 to XSLT2.0. However, I know, this is not trivial 
 at all.

  Well I doubt libxslt will be upgraded to 2.0, as I pointed out
before I really don't have the time for such a massive development,

Daniel

-- 
Daniel Veillard  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
dan...@veillard.com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] i'm here to contribute

2011-11-10 Thread Ladar Levison

On 11/10/11 8:48 PM, Daniel Veillard wrote:

   Well the canonical way is HTML tidy from Dave Ragett (though
he seems to have stepped down) http://tidy.sourceforge.net/


Tidy was a great tool, but the original code hasn't been updatd in three 
years. Replacements have come along, but I haven't found anything in C 
that could integrated into a daemon.

   One of the real development goals that could still make sense
in libxml2 is to make the HTML parser behave like an HTML 5 one
(or allow this as an option), there is already shared code for HTML5
parsing but it's C++ (IIRC) and I can't rely on it. If people start
to agree a bit formally on how to parse web HTML i.e. the ignomous
mixtures that most Web parser are built to process, and handle all
corner cases in a consistent documented way, then upgrading libxml2
to behave in the same way as much as possible would be *great*, but
that would definitely be a lot of work, and I can't commit to anything
like this :-)
   The interesting point in this approach is that it doesn't have to
be 6 months of continous work to produce results, this could be achieved
progressively, adding an HTML_PARSE_HTML5 flag to htmlParserOption
and adding fixes as we meet them and decide to fix them to the
existing HTML parser.
The HTML5 draft goes into the 'rules' for cleaning up malformed 
'fragments'. But its too dense for me to think about a libxml2 integration:


http://www.w3.org/TR/2011/WD-html5-20110525/parsing.html#parsing

I could help write unit tests if someone wants to make an attempt? Once 
the parser is written a slim command line interface to it could be the 
future replacement for HTML tidy.


Once an HTML fragment has been processed into a 'sane' state, sanitizing 
using xpath/xslt rules is feasible.



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] i'm here to contribute

2011-10-31 Thread Stefan Sauer
On 09/18/2011 10:24 PM, Glen Hein wrote:
 Hello,

 I'm a software developer and I'd like to contribute to Gnome's XML
 project. I've used the libxml software for a long time and I'd like to
 give something back.

 I just started a voluntary career break, but I'd like to stay active.

 I looked over the TODO file, but I'm not sure which item to tackle.
 Could you recommend an item for someone new to the project?

 Thanks,
 Glen Hein


One thing that would be super cool would be multi-threaded xslt
processing (e.g. for chunked document output). Unfortunately again, this
is not trivial at all. But any speedup for xslt processing would be
great. The docbook xml - html step in gtk-doc is so slow that most
developers to api-doc generation off still :/

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml