Re: [xml] performance of parsing docbook with xincludes

2018-06-10 Thread Stefan Sauer
On 06/07/2018 01:55 PM, Nick Wellnhofer wrote:
> On 07/06/2018 00:00, Stefan Sauer wrote:
 Another idea is to stop loading external DTDs for XIncludes without an
 XPointer expression. This would still change the behavior for some
 users but it's much less likely to cause problems.
>> change the behaviour, as in we would not catch validation errors?
>
> No, nothing related to validation. If you validate a document, the
> DTDs will always be loaded. But parsing with or without
> XML_PARSE_DTDLOAD will obviously produce different results. It's hard
> to tell whether this will cause problems for users. But maybe I'm
> overly cautious. If someone parses a document without DTD flags, why
> would they assume that XIncluded documents are parsed with
> XML_PARSE_DTDLOAD?
Validation is one thing, but e.g. applying default attributes is another
thing. Basically what I want to avoid is loading the external subset
over and over again, but the internal subset should be applied. I am
still looking where things like
http://www.w3.org/2003/XInclude'">
are applied. The other problem seem to be that id refs between the
master and the xincluded docs are not resolved - is that what
XML_DETECT_IDS controls? I check the docs comment in the sources, but it
is hard to tell. If I don't comment out
  pctxt->loadsubset |= XML_DETECT_IDS;
I get my links resolved, but the speedup is gone.

>
>> Too bad that xmlXIncludeParseFile() does not get the parent parserCtx,
>> in that case we could apply the same flags'.
>
> I think the original flags are already passed via xmlXIncludeSetFlags.
You are right, traced it back.

>
>> It seems that xmldict is only handling key and value to be a string,
>> right? So, we'll even need out one cache data structure. I'd say it
>> would need to be on the _xmlXIncludeCtxt level. global is easier, but
>> then we can't free it ever :/
>
> xmlHash should work fine:
>
>     http://xmlsoft.org/html/libxml-hash.html
>
> But building a DTD cache would be the least of your problems. The hard
> part is to apply a cached DTD to a document. There are some
> interactions between internal and external subsets (see
> xmlAddElementDecl and xmlAddAttributeDecl in valid.c for example), so
> you it looks like you can't just simply set doc->extSubset to the
> cached DTD. You'd probably have to replay the calls to
> xmlAddElementDecl etc, maybe even in the original order which might be
> lost. That's why I wouldn't want to go down this route.

From looking more at the code I aggree. I am now checking if I can share
the xmlDict between all the dtds so that we fix the 25% spent in
xmlFree. I don't want to replace allocators, since I am using it from
python via lxml and I won't be able to patch the allocators.

Thanks for your support on discussing the options.

>
> Nick



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-06-08 Thread Nick Wellnhofer

On 08/06/2018 03:45, Eric S. Eberhard wrote:
Some very simple things to do:  1) put the DTD hosts into the /etc/hosts file 
(or another if you like and substitute an IP)   2)  set /etc/resolv.conf to 
first look in the hosts file (before DNS)


The discussion is not about caching DTDs loaded over the network but from the 
local file system. In this particular case, the same Docbook DTD (~250 KB) is 
parsed more than 100 times for each XInclude.


If I was to suggest a speed up of libxml2 I would change it to allow 
optionally (probably at compile time) to never free memory -- each node, piece 
of data, etc that is created and destroyed constantly would just sit there 
(and slowly grow until it levels out).


libxml2 already allows you to use your own memory allocators. It's easy to 
make `free` a no-op.


Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-06-07 Thread Nick Wellnhofer

On 07/06/2018 00:00, Stefan Sauer wrote:

Another idea is to stop loading external DTDs for XIncludes without an
XPointer expression. This would still change the behavior for some
users but it's much less likely to cause problems.

change the behaviour, as in we would not catch validation errors?


No, nothing related to validation. If you validate a document, the DTDs will 
always be loaded. But parsing with or without XML_PARSE_DTDLOAD will obviously 
produce different results. It's hard to tell whether this will cause problems 
for users. But maybe I'm overly cautious. If someone parses a document without 
DTD flags, why would they assume that XIncluded documents are parsed with 
XML_PARSE_DTDLOAD?



Too bad that xmlXIncludeParseFile() does not get the parent parserCtx,
in that case we could apply the same flags'.


I think the original flags are already passed via xmlXIncludeSetFlags.


It seems that xmldict is only handling key and value to be a string,
right? So, we'll even need out one cache data structure. I'd say it
would need to be on the _xmlXIncludeCtxt level. global is easier, but
then we can't free it ever :/


xmlHash should work fine:

http://xmlsoft.org/html/libxml-hash.html

But building a DTD cache would be the least of your problems. The hard part is 
to apply a cached DTD to a document. There are some interactions between 
internal and external subsets (see xmlAddElementDecl and xmlAddAttributeDecl 
in valid.c for example), so you it looks like you can't just simply set 
doc->extSubset to the cached DTD. You'd probably have to replay the calls to 
xmlAddElementDecl etc, maybe even in the original order which might be lost. 
That's why I wouldn't want to go down this route.


Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-06-07 Thread Stefan Sauer
On 05/17/2018 04:18 PM, Nick Wellnhofer wrote:
> On 16/05/2018 21:51, Stefan Sauer wrote:
>> So one solution could be another flag to enable this?
>
> Yes, but it would be rather ugly.
>
>> Thanks, reading the code. Need to figure where we could cache external
>> subsets and what a suitable keys is (ExternalID ?).
>
> Note that I'm currently not planning to review and integrate larger
> patches from other developers. I only took over some libxml2
> maintenance duties because noone else did. So even if you write a
> high-quality patch, it might never get merged.
>
> Caching external subsets for XIncludes certainly sounds like a nice
> feature but I would prefer to find a simpler solution. For example,
> can't you just omit the external DTD from included documents?
I've tried this and I get some interesting differences. If I modify my
DOCTYPES declarations from e.g.:
http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd;
[
  http://www.w3.org/2003/XInclude'">
  
  %gtkdocentities;
]>

to

http://www.w3.org/2003/XInclude'">
  
  %gtkdocentities;
]>

and run (for each of the variants)
xmllint --noent --xinclude tester-docs.xml >tester-docs.nodtd.xml
then I get a lot of delta in this form:
-    http://www.w3.org/2003/XInclude;
id="api-index-0.1" xml:base="xml/api-index-0.1.xml">
+    

basically if there is no DTD on the doctype, the resulting xi:include
nodes won't have the xmlns:xi attribute.

What is worse and puzzling me that it causes a small difference on the
html output produced my xsltproc:
-FOO, macro in GtkDocTestIf
+FOO, macro in GtkDocTestIf

if I drop the dtd, the first link misses 'class' and 'title' attributes
and the 2nd link is not linked at all.

Stefan

> You wrote:
>
>> and gtk-doc will replicate this for the fragments (replacing 'book' with
>> e.g. 'refentry'). This way one can e.g. inject things like a version.
>
> What do you mean by "inject things like a version"? Why exactly do
> your included documents have to reference an external DTD?
>
> Another idea is to stop loading external DTDs for XIncludes without an
> XPointer expression. This would still change the behavior for some
> users but it's much less likely to cause problems.
>
> Nick



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-06-07 Thread Stefan Sauer
On 06/07/2018 12:54 AM, Eric S. Eberhard wrote:
> I know I am the oddball here but -- why use DTDs at all?

I gave reasons above. I am working on a tool. How people using the tool
is not under my control. Maybe we can focus on the opportunity to
improve libxml2 a bit here.

>   I supply software to a lot of companies (thousands through
> dealers).  Many exchange millions of XML docs per day.  I've used this
> since it was libxml.  Even have some patches in there.  My application
> is proprietary (meaning XML to get an order or tell a customer our
> availability is simply XML I designed and documented and give to my
> customer's customers (via download from a Web page)).  Once they get
> it working it pretty much always works.  They write software to create
> orders and send them to us -- it is consistent (I know, not everyone
> has this luxury so this may not apply to everyone).  So why check them?
>
> I also found that I was getting a gagillion support tickets because
> DTDs ... simple things like a date ... seem to escape people -- take
> June 7, 2018
>
> In our date fields we will take:
>     Jun 7 2018
>     June 7 2018
>     the above with commas and any case (upper/lower/mixed)
>     6/7/18
>     6/7/2018
>     2018/6/7
>     20180607
>     180606
>  06-07-18
>
> And actually many many more.  Anything that is a date goes through
> this one routine and if there is any way in the world to extract a
> date, we do.
>
> Ditto money -- say $1,245.56
>
> We accept:
>     $1,245.56
>   1245.56
>   124556        (decimal is implied at 2 places if no decimal is
> found)
>        1,235.56
>
> And many more - same thing, one routine reads it and if we can
> possibly get a reasonable number, we do.
>
> This, in turn, reduced our CONSTANT support tickets for silly things
> like a format of something to ZERO.  Which I like.
>
> Even sicker -- we ignore case on tags.  All of our XML is designed to
> not use duplicate names with different cases (stupid thing to do
> anyway -- expect orderNumber and OrderNumber to both be used, as
> different things).
>
> As long as the customer is consistent and the XML is well formed we
> scan the tree and compare tags without regard to case.  A WHOLE LOT
> more support tickets gone.
>
> A lot of the people we deal with are not sophisticated.  As the
> receiver of XML we decided it was much better to be as flexible as
> possible and take what we can if at all possible.  After all -- a DTD
> can indeed tell you if an address comes in without a city name.  And
> reject it and usually generate a support ticket.  Since we use an
> on-line AVS system (more XML) and if we have the zip and the address
> otherwise matches ... we don't need the city and state ... the AVS
> system provides it.  And if it fails they will get an error back from
> us (from the application) anyway.  So why use a DTD to see if the city
> or state were sent?  A LOT MORE support calls removed.
>
> And, of course, performance without the DTDs is much better.
>
> As a result we are able to give documentation to new customers and
> they are able to get it up and running with little to no help.  Any
> serious errors we cannot fix are clearly explained in the responses BY
> THE APPLICATION and not by a DTD.
>
> Being flexible on our end reduces support tickets which is all I
> care.  I would rather code for all the mistakes I can think of an
> enduser would make (and we add new ones when they crop up) than be
> strict and do a lot of support.  We don't think DTDs are flexible
> enough.  And I hate making them :-)
>
> We do offer a page with DTDs they can use manually to check their
> document if they like -- or they can send it to our test system.  Once
> they are running they seem to do just fine.
>
> As programmers it is hard to believe but sometimes it is better for us
> to make slightly less efficient code in order to make the human aspect
> much more efficient.  I once had someone send me a link to a "contest"
> which was a convoluted C statement and asking to solve what the result
> would be.  My response -- "fire the programmer!"
>
> If it takes 100s of competent C programmers to get the right answer
> (and only a small percent did) to read a line of code -- it is bad
> code.  And for people's information, modern computers read ahead and
> pre-execute code based on all kinds of weird logic.  Simple C code is
> easy for it to handle ... but convoluted code ends up stopping the
> pre-execution and is actually slower -- may have less lines of code --
> but it will be slower.  I see nothing wrong with short clear clean
> code with as little craziness as possible.  This is the same with XML
> -- one can go overboard easily, K.I.S.S.  :-)
>
> Not being so strict and no DTDs has had other benefits -- say EDI
> (from old IBMs) -- we have a cheap program that maps EDI to XML and
> back.  So we can handle EDI -- and we don't need new software (after
> the conversion).  We accept the EDI, convert to XML, run our 

Re: [xml] performance of parsing docbook with xincludes

2018-06-06 Thread Stefan Sauer
On 05/17/2018 06:01 PM, Stefan Sauer wrote:
> On 05/17/2018 04:18 PM, Nick Wellnhofer wrote:
>> On 16/05/2018 21:51, Stefan Sauer wrote:
>>> So one solution could be another flag to enable this?
>> Yes, but it would be rather ugly.
> In which sense? I guess because it is something that noone should need
> to know about or have to care about?
>>> Thanks, reading the code. Need to figure where we could cache external
>>> subsets and what a suitable keys is (ExternalID ?).
>> Note that I'm currently not planning to review and integrate larger
>> patches from other developers. I only took over some libxml2
>> maintenance duties because noone else did. So even if you write a
>> high-quality patch, it might never get merged.
> Thanks for making this clear upfront. This is how I ended up becoming
> the gtkdoc maintainer :)
>
>> Caching external subsets for XIncludes certainly sounds like a nice
>> feature but I would prefer to find a simpler solution. For example,
>> can't you just omit the external DTD from included documents?
> Yeah, right now, the benefit of having the DTD is that one can validate
> fragments. I'll do some research (aka grepping over existing projects)
> to see how the doc-type headers being used today look like. If all that
> people do is using an entity to inject the version, I'll write a
> migration tool.
>
> We have a test that validates the doc, but I think I can change this to
> just resolve all xincludes and check through the top-level doctype.

Just to add to this, I am assuming a lot of people follow this book
http://www.sagehill.net/docbookxsl/ModularDoc.html#UsingXinclude

and using a DOCTYPE is part of the examples.
>> You wrote:
>>
>>> and gtk-doc will replicate this for the fragments (replacing 'book' with
>>> e.g. 'refentry'). This way one can e.g. inject things like a version.
>> What do you mean by "inject things like a version"? Why exactly do
>> your included documents have to reference an external DTD?
> The documentation consists of a handwritten master doc (type book), that
> includes more handwritten parts (e.g. tutorials, guides) and include
> generated reference docs. When gtkdoc generated the reference docs, it
> applies takes the doctype header of the master-doc as a template and
> uses that for the generated reference docs. If the master doc has
> entities declared, those can be expanded in the reference fragments.
> Thats the part I will check how widely it is actually used.
>
> Stefan
>
>> Another idea is to stop loading external DTDs for XIncludes without an
>> XPointer expression. This would still change the behavior for some
>> users but it's much less likely to cause problems.
change the behaviour, as in we would not catch validation errors?
Too bad that xmlXIncludeParseFile() does not get the parent parserCtx,
in that case we could apply the same flags'.
>>
>> Nick
> I definitely don't know enough about the implications here. I was mostly
> thinking to see if we can stick a dictionary of  xmlDtdPtr> into the Parser Context and before actually loading a dtd,
> check if we did already and reuse. Somehow the dict needs to be stored
> in the top-level doc, when parsing is done (do we need the dtds once the
> doc has been parsed?). We only free the dtds with the top-level doc. But
> I agree, it is not going to be a two liner.

It seems that xmldict is only handling key and value to be a string,
right? So, we'll even need out one cache data structure. I'd say it
would need to be on the _xmlXIncludeCtxt level. global is easier, but
then we can't free it ever :/

Stefan
>
> Stefan
>
>
> ___
> xml mailing list, project page  http://xmlsoft.org/
> xml@gnome.org
> https://mail.gnome.org/mailman/listinfo/xml



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-17 Thread Stefan Sauer
On 05/17/2018 04:18 PM, Nick Wellnhofer wrote:
> On 16/05/2018 21:51, Stefan Sauer wrote:
>> So one solution could be another flag to enable this?
>
> Yes, but it would be rather ugly.
In which sense? I guess because it is something that noone should need
to know about or have to care about?
>
>> Thanks, reading the code. Need to figure where we could cache external
>> subsets and what a suitable keys is (ExternalID ?).
>
> Note that I'm currently not planning to review and integrate larger
> patches from other developers. I only took over some libxml2
> maintenance duties because noone else did. So even if you write a
> high-quality patch, it might never get merged.
Thanks for making this clear upfront. This is how I ended up becoming
the gtkdoc maintainer :)

>
> Caching external subsets for XIncludes certainly sounds like a nice
> feature but I would prefer to find a simpler solution. For example,
> can't you just omit the external DTD from included documents?
Yeah, right now, the benefit of having the DTD is that one can validate
fragments. I'll do some research (aka grepping over existing projects)
to see how the doc-type headers being used today look like. If all that
people do is using an entity to inject the version, I'll write a
migration tool.

We have a test that validates the doc, but I think I can change this to
just resolve all xincludes and check through the top-level doctype.


> You wrote:
>
>> and gtk-doc will replicate this for the fragments (replacing 'book' with
>> e.g. 'refentry'). This way one can e.g. inject things like a version.
>
> What do you mean by "inject things like a version"? Why exactly do
> your included documents have to reference an external DTD?

The documentation consists of a handwritten master doc (type book), that
includes more handwritten parts (e.g. tutorials, guides) and include
generated reference docs. When gtkdoc generated the reference docs, it
applies takes the doctype header of the master-doc as a template and
uses that for the generated reference docs. If the master doc has
entities declared, those can be expanded in the reference fragments.
Thats the part I will check how widely it is actually used.

Stefan

>
> Another idea is to stop loading external DTDs for XIncludes without an
> XPointer expression. This would still change the behavior for some
> users but it's much less likely to cause problems.
>
> Nick

I definitely don't know enough about the implications here. I was mostly
thinking to see if we can stick a dictionary of  into the Parser Context and before actually loading a dtd,
check if we did already and reuse. Somehow the dict needs to be stored
in the top-level doc, when parsing is done (do we need the dtds once the
doc has been parsed?). We only free the dtds with the top-level doc. But
I agree, it is not going to be a two liner.

Stefan


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-17 Thread Nick Wellnhofer

On 16/05/2018 21:51, Stefan Sauer wrote:

So one solution could be another flag to enable this?


Yes, but it would be rather ugly.


Thanks, reading the code. Need to figure where we could cache external
subsets and what a suitable keys is (ExternalID ?).


Note that I'm currently not planning to review and integrate larger patches 
from other developers. I only took over some libxml2 maintenance duties 
because noone else did. So even if you write a high-quality patch, it might 
never get merged.


Caching external subsets for XIncludes certainly sounds like a nice feature 
but I would prefer to find a simpler solution. For example, can't you just 
omit the external DTD from included documents? You wrote:



and gtk-doc will replicate this for the fragments (replacing 'book' with
e.g. 'refentry'). This way one can e.g. inject things like a version.


What do you mean by "inject things like a version"? Why exactly do your 
included documents have to reference an external DTD?


Another idea is to stop loading external DTDs for XIncludes without an 
XPointer expression. This would still change the behavior for some users but 
it's much less likely to cause problems.


Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-17 Thread Stefan Sauer
On 05/17/2018 01:40 AM, Eric S. Eberhard wrote:
> So again, the first time, put it on your machine and use an IP (or
> localhost if you can which goes straight to the TCP/IP stack -- never
> goes to the network).  Also, you can do one other tacky thing I do ...
> sometimes when I see things like that and I don't need it, I just
> change libxml2 -- and put comments so I can update.  I have 2-3 of
> those.  libxml2 tries to satisfy the entire world ... you only need to
> make you happy :-)  You might also consider a static link if doing
> that -- safer if customer loads a different version, and it keeps
> loading itself from changing anything, and the program loads faster, etc.

Sure, that's where we are now. I am looking for this change to help
other developers, so I'll need to find a solution that can be merged
into libxml2. But thanks for the input.

Stefan

>
> Eric
>
> On 5/15/2018 3:42 AM, Nick Wellnhofer wrote:
>> On 14/05/2018 21:48, Stefan Sauer wrote:
>>> This part looks suspicious:
>>>
>>>     |--22.98%--0xc2160
>>>     |  xmlFreeDoc
>>>     |  |
>>>     |   --22.42%--xmlFreeDtd
>>
>>> Can I tell it to not load dtds in the first place? Is it loading the
>>> dtd for each an every xinclude?
>>
>> Good catch. It seems that the XInclude engine always parses included
>> docs with XML_PARSE_DTDLOAD:
>>
>>     https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450
>>
>> If you're not using XML catalogs, this will probably cause the DTD to
>> be loaded over the network multiple times which could explain the
>> slowdown.
>>
>> Can you try to change the line to
>>
>>     xmlCtxtUseOptions(pctxt, ctxt->parseFlags);
>>
>> and see if it helps?
>>
>> Nick
>> ___
>> xml mailing list, project page  http://xmlsoft.org/
>> xml@gnome.org
>> https://mail.gnome.org/mailman/listinfo/xml
>>
>
> -- 
> Eric S. Eberhard
> VICS
> 2933 W Middle Verde Road
> Camp Verde, AZ  86322
>
> 928-567-3727  work  928-301-7537  cell
>
> http://www.vicsmba.com/index.html (our work)
> http://www.vicsmba.com/ourpics/index.html (fun pictures)


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-16 Thread Stefan Sauer
On 05/16/2018 12:41 AM, Nick Wellnhofer wrote:
> On May 15, 2018, at 21:56 , Stefan Sauer  wrote:
>> On 05/15/2018 08:40 PM, Stefan Sauer wrote:
>>> On 05/15/2018 12:42 PM, Nick Wellnhofer wrote:
 Can you try to change the line to

 xmlCtxtUseOptions(pctxt, ctxt->parseFlags);

 and see if it helps?

>>> It does not help. I'll experiment further. Thanks for the recomendations.
> I think you also have to remove the line at 
> https://git.gnome.org/browse/libxml2/tree/xinclude.c#n463
>
> pctxt->loadsubset |= XML_DETECT_IDS;
>
> Looks like the idea is to make sure that ID attributes are detected for 
> XIncludes with XPointers. IMO, it should be the application's responsibility 
> to set the XML_PARSE_DTDLOAD flag in this case. But changing the behavior 
> might break code that relies on this feature.
This helps!

LD_LIBRARY_PATH=~/debug/lib ~/debug/bin/xmllint --timing --xinclude
--nonet --noent --noout glib-docs.xml
Parsing took 0 ms
Xinclude processing took 179 ms
Freeing took 17 ms

So one solution could be another flag to enable this?
>> Is libxml2 doing that for each file over and over?
> Yes.
Actually easy to confirm using --load-trace:
https://gist.github.com/ensonic/e1c4c7f80a0c072d119a649722de1e20
>> Wouldn't it make sense to only load each dtd once?
> This would make sense.
>
>> And where exatly is it loaded (I can only
>> see xmlFreeDtd, but can't find a xmlLoadDtd or the like.
> Via xmlParseDocument -> xmlSAX2ExternalSubset -> xmlParseExternalSubset.
Thanks, reading the code. Need to figure where we could cache external
subsets and what a suitable keys is (ExternalID ?).

Stefan

>
> Nick
>


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-15 Thread Nick Wellnhofer
On May 15, 2018, at 21:56 , Stefan Sauer  wrote:
> 
> On 05/15/2018 08:40 PM, Stefan Sauer wrote:
>> On 05/15/2018 12:42 PM, Nick Wellnhofer wrote:
>>> Can you try to change the line to
>>> 
>>> xmlCtxtUseOptions(pctxt, ctxt->parseFlags);
>>> 
>>> and see if it helps?
>>> 
>> It does not help. I'll experiment further. Thanks for the recomendations.

I think you also have to remove the line at 
https://git.gnome.org/browse/libxml2/tree/xinclude.c#n463

pctxt->loadsubset |= XML_DETECT_IDS;

Looks like the idea is to make sure that ID attributes are detected for 
XIncludes with XPointers. IMO, it should be the application's responsibility to 
set the XML_PARSE_DTDLOAD flag in this case. But changing the behavior might 
break code that relies on this feature.

> Is libxml2 doing that for each file over and over?

Yes.

> Wouldn't it make sense to only load each dtd once?

This would make sense.

> And where exatly is it loaded (I can only
> see xmlFreeDtd, but can't find a xmlLoadDtd or the like.

Via xmlParseDocument -> xmlSAX2ExternalSubset -> xmlParseExternalSubset.

Nick

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-15 Thread Stefan Sauer
On 05/15/2018 08:40 PM, Stefan Sauer wrote:
> On 05/15/2018 12:42 PM, Nick Wellnhofer wrote:
>> On 14/05/2018 21:48, Stefan Sauer wrote:
>>> This part looks suspicious:
>>>
>>>     |--22.98%--0xc2160
>>>     |  xmlFreeDoc
>>>     |  |
>>>     |   --22.42%--xmlFreeDtd
>>> Can I tell it to not load dtds in the first place? Is it loading the
>>> dtd for each an every xinclude?
>> Good catch. It seems that the XInclude engine always parses included
>> docs with XML_PARSE_DTDLOAD:
>>
>>     https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450
>>
>> If you're not using XML catalogs, this will probably cause the DTD to
>> be loaded over the network multiple times which could explain the
>> slowdown.
>>
>> Can you try to change the line to
>>
>>     xmlCtxtUseOptions(pctxt, ctxt->parseFlags);
>>
>> and see if it helps?
>>
>> Nick
> It does not help. I'll experiment further. Thanks for the recomendations.
and FYI: a call grpah plot:
https://imgur.com/a/d27xxor

As an experiemnt I dropped the doctype headers for the (generated)
xincluded files. So no it is 20 files with doctype headers  + 105
(generated) files without doctype headers. And voila!

xmllint --timing --xinclude  --noout glib-docs.xml
Parsing took 0 ms
Xinclude processing took 447 ms
Freeing took 19 ms

The docbook header looks like this:


http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd' [
http://www.w3.org/2003/XInclude'">

]>

and gtk-doc will replicate this for the fragments (replacing 'book' with
e.g. 'refentry'). This way one can e.g. inject things like a version.

I do have the /usr/share/xml/docbook/schema/dtd/4.5/docbookx.dtd locally
available. I guess there is no way avoiding to loading the dtd then. Is
libxml2 doing that for each file over and over? Wouldn't it make sense
to only load each dtd once? And where exatly is it loaded (I can only
see xmlFreeDtd, but can't find a xmlLoadDtd or the like.

Sorry for all the questions, but it looks like there is low hanging
fruit to save a lot of cpu time.

Stefan
>
>
> Stefan
>
> ___
> xml mailing list, project page  http://xmlsoft.org/
> xml@gnome.org
> https://mail.gnome.org/mailman/listinfo/xml



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-15 Thread Stefan Sauer
On 05/15/2018 12:42 PM, Nick Wellnhofer wrote:
> On 14/05/2018 21:48, Stefan Sauer wrote:
>> This part looks suspicious:
>>
>>     |--22.98%--0xc2160
>>     |  xmlFreeDoc
>>     |  |
>>     |   --22.42%--xmlFreeDtd
>
>> Can I tell it to not load dtds in the first place? Is it loading the
>> dtd for each an every xinclude?
>
> Good catch. It seems that the XInclude engine always parses included
> docs with XML_PARSE_DTDLOAD:
>
>     https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450
>
> If you're not using XML catalogs, this will probably cause the DTD to
> be loaded over the network multiple times which could explain the
> slowdown.
>
> Can you try to change the line to
>
>     xmlCtxtUseOptions(pctxt, ctxt->parseFlags);
>
> and see if it helps?
>
> Nick

It does not help. I'll experiment further. Thanks for the recomendations.


Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-15 Thread Nick Wellnhofer

On 14/05/2018 21:48, Stefan Sauer wrote:

This part looks suspicious:

|--22.98%--0xc2160
|  xmlFreeDoc
|  |
|   --22.42%--xmlFreeDtd



Can I tell it to not load dtds in the first place? Is it loading the dtd for 
each an every xinclude?


Good catch. It seems that the XInclude engine always parses included docs with 
XML_PARSE_DTDLOAD:


https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450

If you're not using XML catalogs, this will probably cause the DTD to be 
loaded over the network multiple times which could explain the slowdown.


Can you try to change the line to

xmlCtxtUseOptions(pctxt, ctxt->parseFlags);

and see if it helps?

Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-14 Thread Stefan Sauer
On 05/14/2018 09:48 PM, Stefan Sauer wrote:
> On 05/14/2018 12:19 PM, Nick Wellnhofer wrote:
>> On 13/05/2018 20:54, Stefan Sauer wrote:
>>> Lets look at some numbers using glib
>>> (https://gitlab.gnome.org/GNOME/glib)
>>>
>>> cd glib/docs/reference/glib
>>> xmllint --timing --xinclude --noout glib-docs.xml
>>> Parsing took 0 ms
>>> Xinclude processing took 4560 ms
>>> Freeing took 91 ms
>>>
>>> Any idea how I can get more breakdown of whats happening in  'Xinclude
>>> processing'?
>>
>> It seems that "XInclude processing" also contains the time needed to
>> parse the included documents, so maybe the XIncludes aren't the issue
>> at all (glib-docs.xml is a small document including several larger
>> ones). Can you save glib-docs.xml after processing XIncludes and
>> check whether parsing the consolidated document is considerably faster?
>>
>>> Running with "perf record -g -- xmllint --timing --xinclude --noout
>>> glib-docs.xml" gets me such a report.
>>>
>>> +   17.15%    16.69%  xmllint  libc-2.24.so    [.] _int_malloc
>>> +   11.93%    11.87%  xmllint  libc-2.24.so    [.]
>>> malloc_consolidate
>>> +    9.01% 8.97%  xmllint  libxml2.so.2.9.4    [.] xmlDictLookup
>>> +    7.15% 0.00%  xmllint  ld-2.24.so  [.]
>>> 0x8021a0022010
>>> +    6.25% 6.21%  xmllint  libxml2.so.2.9.4    [.] xmlHashAddEntry3
>>> +    6.22% 0.00%  xmllint  libxml2.so.2.9.4    [.]
>>> xmlSAX2IsStandalone
>>> +    6.22% 0.00%  xmllint  [unknown]   [.]
>>> 0x56413c74c0854810
>>> +    3.95% 3.94%  xmllint  libxml2.so.2.9.4    [.] xmlHashLookup2
>>>   3.72% 3.70%  xmllint  libc-2.24.so    [.] _int_free
>>> +    3.28% 0.00%  xmllint  [unknown]   [.] 
>>> +    3.06% 3.04%  xmllint  libxml2.so.2.9.4    [.]
>>> xmlFreeDocElementContent
>>> +    2.96% 2.91%  xmllint  libc-2.24.so    [.] free
>>
>> The callgraph based reports (perf report -g or -G) are usually more
>> helpful.
>
> This part looks suspicious:
>|--22.98%--0xc2160
>|  xmlFreeDoc
>|  |  
>|   --22.42%--xmlFreeDtd
>| |  
>| |--19.62%--xmlHashFree
>| |  |  
>| |  |--10.03%--_int_free
>| |  |  |  
>| |  |   
> --9.56%--malloc_consolidate
>| |  |  
>| |  |--3.69%--0x7e560
>| |  |  
> xmlFreeDocElementContent
>| |  |  |  
>| |  |   
> --2.19%--xmlFreeDocElementContent
>| |  |  
>| |  |--0.71%--0x7face
>| |  |  
>| |  |--0.66%--0x30498
>| |  |  
>| |   --0.61%--0x7fae3
>| | xmlUnlinkNode
>| |  
>|  --0.89%--xmlFreeNode
>
>
> Can I tell it to not load dtds in the first place? Is it loading the dtd for 
> each an every xinclude?
>
> Stefan
All my xincluded files have doctype headers like:

http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd; [
]>

If I remove them it seems to become faster. I'll see if I can
programmatically strip them all to be really sure though.

Stefan

>
>>
>>> Any ideas. Is there a know issues with using xincludes here?
>>
>> It might be quadratic behavior in the XInclude engine or something
>> else entirely. How large is glib-docs.xml after processing XIncludes?
>>
>> Nick
>
>
>
>
> ___
> xml mailing list, project page  http://xmlsoft.org/
> xml@gnome.org
> https://mail.gnome.org/mailman/listinfo/xml


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-14 Thread Stefan Sauer
On 05/14/2018 12:19 PM, Nick Wellnhofer wrote:
> On 13/05/2018 20:54, Stefan Sauer wrote:
>> Lets look at some numbers using glib
>> (https://gitlab.gnome.org/GNOME/glib)
>>
>> cd glib/docs/reference/glib
>> xmllint --timing --xinclude --noout glib-docs.xml
>> Parsing took 0 ms
>> Xinclude processing took 4560 ms
>> Freeing took 91 ms
>>
>> Any idea how I can get more breakdown of whats happening in  'Xinclude
>> processing'?
>
> It seems that "XInclude processing" also contains the time needed to
> parse the included documents, so maybe the XIncludes aren't the issue
> at all (glib-docs.xml is a small document including several larger
> ones). Can you save glib-docs.xml after processing XIncludes and check
> whether parsing the consolidated document is considerably faster?
>
>> Running with "perf record -g -- xmllint --timing --xinclude --noout
>> glib-docs.xml" gets me such a report.
>>
>> +   17.15%    16.69%  xmllint  libc-2.24.so    [.] _int_malloc
>> +   11.93%    11.87%  xmllint  libc-2.24.so    [.]
>> malloc_consolidate
>> +    9.01% 8.97%  xmllint  libxml2.so.2.9.4    [.] xmlDictLookup
>> +    7.15% 0.00%  xmllint  ld-2.24.so  [.]
>> 0x8021a0022010
>> +    6.25% 6.21%  xmllint  libxml2.so.2.9.4    [.] xmlHashAddEntry3
>> +    6.22% 0.00%  xmllint  libxml2.so.2.9.4    [.]
>> xmlSAX2IsStandalone
>> +    6.22% 0.00%  xmllint  [unknown]   [.]
>> 0x56413c74c0854810
>> +    3.95% 3.94%  xmllint  libxml2.so.2.9.4    [.] xmlHashLookup2
>>   3.72% 3.70%  xmllint  libc-2.24.so    [.] _int_free
>> +    3.28% 0.00%  xmllint  [unknown]   [.] 
>> +    3.06% 3.04%  xmllint  libxml2.so.2.9.4    [.]
>> xmlFreeDocElementContent
>> +    2.96% 2.91%  xmllint  libc-2.24.so    [.] free
>
> The callgraph based reports (perf report -g or -G) are usually more
> helpful.

This part looks suspicious:

   |--22.98%--0xc2160
   |  xmlFreeDoc
   |  |  
   |   --22.42%--xmlFreeDtd
   | |  
   | |--19.62%--xmlHashFree
   | |  |  
   | |  |--10.03%--_int_free
   | |  |  |  
   | |  |   
--9.56%--malloc_consolidate
   | |  |  
   | |  |--3.69%--0x7e560
   | |  |  
xmlFreeDocElementContent
   | |  |  |  
   | |  |   
--2.19%--xmlFreeDocElementContent
   | |  |  
   | |  |--0.71%--0x7face
   | |  |  
   | |  |--0.66%--0x30498
   | |  |  
   | |   --0.61%--0x7fae3
   | | xmlUnlinkNode
   | |  
   |  --0.89%--xmlFreeNode


Can I tell it to not load dtds in the first place? Is it loading the dtd for 
each an every xinclude?

Stefan

>
>> Any ideas. Is there a know issues with using xincludes here?
>
> It might be quadratic behavior in the XInclude engine or something
> else entirely. How large is glib-docs.xml after processing XIncludes?
>
> Nick


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-14 Thread Stefan Sauer
Seems that the list does not like attachements, so resensing with links

On 05/14/2018 12:19 PM, Nick Wellnhofer wrote:
> On 13/05/2018 20:54, Stefan Sauer wrote:
>> Lets look at some numbers using glib
>> (https://gitlab.gnome.org/GNOME/glib)
>>
>> cd glib/docs/reference/glib
>> xmllint --timing --xinclude --noout glib-docs.xml
>> Parsing took 0 ms
>> Xinclude processing took 4560 ms
>> Freeing took 91 ms
>>
>> Any idea how I can get more breakdown of whats happening in  'Xinclude
>> processing'?
>
> It seems that "XInclude processing" also contains the time needed to
> parse the included documents, so maybe the XIncludes aren't the issue
> at all (glib-docs.xml is a small document including several larger
> ones). Can you save glib-docs.xml after processing XIncludes and check
> whether parsing the consolidated document is considerably faster?

$ xmllint --timing --xinclude glib-docs.xml >glib-full.xml

Parsing took 0 ms
Xinclude processing took 4726 ms
Saving took 85 ms
Freeing took 104 ms

$ xmllint --timing --noout glib-full.xml

Parsing took 151 ms
Freeing took 14 ms

$ ls -alh glib-full.xml

-rw-r--r-- 1 ensonic users 6.8M May 14 12:42 glib-full.xml

Parsing the consolidated doc is a magnitude faster. Thanks for
suggesting this test.

$ xtime sh -c 'find . -name "*.xml" | grep -v version.xml | xargs xmllint 
--noout'
0.40u 0.04s 0.44r 70296kB sh -c find . -name "*.xml" | grep -v version.xml | 
xargs xmllint --noout

Even parsing all files like this is >10 times faster.

>
>> Running with "perf record -g -- xmllint --timing --xinclude --noout
>> glib-docs.xml" gets me such a report.
>>
>> +   17.15%    16.69%  xmllint  libc-2.24.so    [.] _int_malloc
>> +   11.93%    11.87%  xmllint  libc-2.24.so    [.]
>> malloc_consolidate
>> +    9.01% 8.97%  xmllint  libxml2.so.2.9.4    [.] xmlDictLookup
>> +    7.15% 0.00%  xmllint  ld-2.24.so  [.]
>> 0x8021a0022010
>> +    6.25% 6.21%  xmllint  libxml2.so.2.9.4    [.] xmlHashAddEntry3
>> +    6.22% 0.00%  xmllint  libxml2.so.2.9.4    [.]
>> xmlSAX2IsStandalone
>> +    6.22% 0.00%  xmllint  [unknown]   [.]
>> 0x56413c74c0854810
>> +    3.95% 3.94%  xmllint  libxml2.so.2.9.4    [.] xmlHashLookup2
>>   3.72% 3.70%  xmllint  libc-2.24.so    [.] _int_free
>> +    3.28% 0.00%  xmllint  [unknown]   [.] 
>> +    3.06% 3.04%  xmllint  libxml2.so.2.9.4    [.]
>> xmlFreeDocElementContent
>> +    2.96% 2.91%  xmllint  libc-2.24.so    [.] free
>
> The callgraph based reports (perf report -g or -G) are usually more
> helpful.
here is a full perf.log with callgraph.
https://gist.github.com/ensonic/a73608edd60a995374e0961d3840eb4f
>
>> Any ideas. Is there a know issues with using xincludes here?
>
> It might be quadratic behavior in the XInclude engine or something
> else entirely. How large is glib-docs.xml after processing XIncludes?

See above. Btw. only the toplevel doc is using xincludes, the included
docs don't include other docs:

$grep "xi:inc" glib-docs.xml | wc -l

116



>
> Nick


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-14 Thread Nick Wellnhofer

On 13/05/2018 20:54, Stefan Sauer wrote:

Lets look at some numbers using glib (https://gitlab.gnome.org/GNOME/glib)

cd glib/docs/reference/glib
xmllint --timing --xinclude --noout glib-docs.xml
Parsing took 0 ms
Xinclude processing took 4560 ms
Freeing took 91 ms

Any idea how I can get more breakdown of whats happening in  'Xinclude
processing'?


It seems that "XInclude processing" also contains the time needed to parse the 
included documents, so maybe the XIncludes aren't the issue at all 
(glib-docs.xml is a small document including several larger ones). Can you 
save glib-docs.xml after processing XIncludes and check whether parsing the 
consolidated document is considerably faster?



Running with "perf record -g -- xmllint --timing --xinclude --noout
glib-docs.xml" gets me such a report.

+   17.15%    16.69%  xmllint  libc-2.24.so    [.] _int_malloc
+   11.93%    11.87%  xmllint  libc-2.24.so    [.] malloc_consolidate
+    9.01% 8.97%  xmllint  libxml2.so.2.9.4    [.] xmlDictLookup
+    7.15% 0.00%  xmllint  ld-2.24.so  [.] 0x8021a0022010
+    6.25% 6.21%  xmllint  libxml2.so.2.9.4    [.] xmlHashAddEntry3
+    6.22% 0.00%  xmllint  libxml2.so.2.9.4    [.] xmlSAX2IsStandalone
+    6.22% 0.00%  xmllint  [unknown]   [.] 0x56413c74c0854810
+    3.95% 3.94%  xmllint  libxml2.so.2.9.4    [.] xmlHashLookup2
  3.72% 3.70%  xmllint  libc-2.24.so    [.] _int_free
+    3.28% 0.00%  xmllint  [unknown]   [.] 
+    3.06% 3.04%  xmllint  libxml2.so.2.9.4    [.]
xmlFreeDocElementContent
+    2.96% 2.91%  xmllint  libc-2.24.so    [.] free


The callgraph based reports (perf report -g or -G) are usually more helpful.


Any ideas. Is there a know issues with using xincludes here?


It might be quadratic behavior in the XInclude engine or something else 
entirely. How large is glib-docs.xml after processing XIncludes?


Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml