Re: [xml] performance of parsing docbook with xincludes

2018-05-15 Thread Nick Wellnhofer
On May 15, 2018, at 21:56 , Stefan Sauer  wrote:
> 
> On 05/15/2018 08:40 PM, Stefan Sauer wrote:
>> On 05/15/2018 12:42 PM, Nick Wellnhofer wrote:
>>> Can you try to change the line to
>>> 
>>> xmlCtxtUseOptions(pctxt, ctxt->parseFlags);
>>> 
>>> and see if it helps?
>>> 
>> It does not help. I'll experiment further. Thanks for the recomendations.

I think you also have to remove the line at 
https://git.gnome.org/browse/libxml2/tree/xinclude.c#n463

pctxt->loadsubset |= XML_DETECT_IDS;

Looks like the idea is to make sure that ID attributes are detected for 
XIncludes with XPointers. IMO, it should be the application's responsibility to 
set the XML_PARSE_DTDLOAD flag in this case. But changing the behavior might 
break code that relies on this feature.

> Is libxml2 doing that for each file over and over?

Yes.

> Wouldn't it make sense to only load each dtd once?

This would make sense.

> And where exatly is it loaded (I can only
> see xmlFreeDtd, but can't find a xmlLoadDtd or the like.

Via xmlParseDocument -> xmlSAX2ExternalSubset -> xmlParseExternalSubset.

Nick

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-15 Thread Stefan Sauer
On 05/15/2018 08:40 PM, Stefan Sauer wrote:
> On 05/15/2018 12:42 PM, Nick Wellnhofer wrote:
>> On 14/05/2018 21:48, Stefan Sauer wrote:
>>> This part looks suspicious:
>>>
>>>     |--22.98%--0xc2160
>>>     |  xmlFreeDoc
>>>     |  |
>>>     |   --22.42%--xmlFreeDtd
>>> Can I tell it to not load dtds in the first place? Is it loading the
>>> dtd for each an every xinclude?
>> Good catch. It seems that the XInclude engine always parses included
>> docs with XML_PARSE_DTDLOAD:
>>
>>     https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450
>>
>> If you're not using XML catalogs, this will probably cause the DTD to
>> be loaded over the network multiple times which could explain the
>> slowdown.
>>
>> Can you try to change the line to
>>
>>     xmlCtxtUseOptions(pctxt, ctxt->parseFlags);
>>
>> and see if it helps?
>>
>> Nick
> It does not help. I'll experiment further. Thanks for the recomendations.
and FYI: a call grpah plot:
https://imgur.com/a/d27xxor

As an experiemnt I dropped the doctype headers for the (generated)
xincluded files. So no it is 20 files with doctype headers  + 105
(generated) files without doctype headers. And voila!

xmllint --timing --xinclude  --noout glib-docs.xml
Parsing took 0 ms
Xinclude processing took 447 ms
Freeing took 19 ms

The docbook header looks like this:


http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd' [
http://www.w3.org/2003/XInclude'">

]>

and gtk-doc will replicate this for the fragments (replacing 'book' with
e.g. 'refentry'). This way one can e.g. inject things like a version.

I do have the /usr/share/xml/docbook/schema/dtd/4.5/docbookx.dtd locally
available. I guess there is no way avoiding to loading the dtd then. Is
libxml2 doing that for each file over and over? Wouldn't it make sense
to only load each dtd once? And where exatly is it loaded (I can only
see xmlFreeDtd, but can't find a xmlLoadDtd or the like.

Sorry for all the questions, but it looks like there is low hanging
fruit to save a lot of cpu time.

Stefan
>
>
> Stefan
>
> ___
> xml mailing list, project page  http://xmlsoft.org/
> xml@gnome.org
> https://mail.gnome.org/mailman/listinfo/xml



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-15 Thread Stefan Sauer
On 05/15/2018 12:42 PM, Nick Wellnhofer wrote:
> On 14/05/2018 21:48, Stefan Sauer wrote:
>> This part looks suspicious:
>>
>>     |--22.98%--0xc2160
>>     |  xmlFreeDoc
>>     |  |
>>     |   --22.42%--xmlFreeDtd
>
>> Can I tell it to not load dtds in the first place? Is it loading the
>> dtd for each an every xinclude?
>
> Good catch. It seems that the XInclude engine always parses included
> docs with XML_PARSE_DTDLOAD:
>
>     https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450
>
> If you're not using XML catalogs, this will probably cause the DTD to
> be loaded over the network multiple times which could explain the
> slowdown.
>
> Can you try to change the line to
>
>     xmlCtxtUseOptions(pctxt, ctxt->parseFlags);
>
> and see if it helps?
>
> Nick

It does not help. I'll experiment further. Thanks for the recomendations.


Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-05-15 Thread Nick Wellnhofer

On 14/05/2018 21:48, Stefan Sauer wrote:

This part looks suspicious:

|--22.98%--0xc2160
|  xmlFreeDoc
|  |
|   --22.42%--xmlFreeDtd



Can I tell it to not load dtds in the first place? Is it loading the dtd for 
each an every xinclude?


Good catch. It seems that the XInclude engine always parses included docs with 
XML_PARSE_DTDLOAD:


https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450

If you're not using XML catalogs, this will probably cause the DTD to be 
loaded over the network multiple times which could explain the slowdown.


Can you try to change the line to

xmlCtxtUseOptions(pctxt, ctxt->parseFlags);

and see if it helps?

Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml