Re: [xml] performance of parsing docbook with xincludes
On 06/07/2018 01:55 PM, Nick Wellnhofer wrote: > On 07/06/2018 00:00, Stefan Sauer wrote: Another idea is to stop loading external DTDs for XIncludes without an XPointer expression. This would still change the behavior for some users but it's much less likely to cause problems. >> change the behaviour, as in we would not catch validation errors? > > No, nothing related to validation. If you validate a document, the > DTDs will always be loaded. But parsing with or without > XML_PARSE_DTDLOAD will obviously produce different results. It's hard > to tell whether this will cause problems for users. But maybe I'm > overly cautious. If someone parses a document without DTD flags, why > would they assume that XIncluded documents are parsed with > XML_PARSE_DTDLOAD? Validation is one thing, but e.g. applying default attributes is another thing. Basically what I want to avoid is loading the external subset over and over again, but the internal subset should be applied. I am still looking where things like http://www.w3.org/2003/XInclude'"> are applied. The other problem seem to be that id refs between the master and the xincluded docs are not resolved - is that what XML_DETECT_IDS controls? I check the docs comment in the sources, but it is hard to tell. If I don't comment out pctxt->loadsubset |= XML_DETECT_IDS; I get my links resolved, but the speedup is gone. > >> Too bad that xmlXIncludeParseFile() does not get the parent parserCtx, >> in that case we could apply the same flags'. > > I think the original flags are already passed via xmlXIncludeSetFlags. You are right, traced it back. > >> It seems that xmldict is only handling key and value to be a string, >> right? So, we'll even need out one cache data structure. I'd say it >> would need to be on the _xmlXIncludeCtxt level. global is easier, but >> then we can't free it ever :/ > > xmlHash should work fine: > > http://xmlsoft.org/html/libxml-hash.html > > But building a DTD cache would be the least of your problems. The hard > part is to apply a cached DTD to a document. There are some > interactions between internal and external subsets (see > xmlAddElementDecl and xmlAddAttributeDecl in valid.c for example), so > you it looks like you can't just simply set doc->extSubset to the > cached DTD. You'd probably have to replay the calls to > xmlAddElementDecl etc, maybe even in the original order which might be > lost. That's why I wouldn't want to go down this route. From looking more at the code I aggree. I am now checking if I can share the xmlDict between all the dtds so that we fix the 25% spent in xmlFree. I don't want to replace allocators, since I am using it from python via lxml and I won't be able to patch the allocators. Thanks for your support on discussing the options. > > Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 08/06/2018 03:45, Eric S. Eberhard wrote: Some very simple things to do: 1) put the DTD hosts into the /etc/hosts file (or another if you like and substitute an IP) 2) set /etc/resolv.conf to first look in the hosts file (before DNS) The discussion is not about caching DTDs loaded over the network but from the local file system. In this particular case, the same Docbook DTD (~250 KB) is parsed more than 100 times for each XInclude. If I was to suggest a speed up of libxml2 I would change it to allow optionally (probably at compile time) to never free memory -- each node, piece of data, etc that is created and destroyed constantly would just sit there (and slowly grow until it levels out). libxml2 already allows you to use your own memory allocators. It's easy to make `free` a no-op. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 07/06/2018 00:00, Stefan Sauer wrote: Another idea is to stop loading external DTDs for XIncludes without an XPointer expression. This would still change the behavior for some users but it's much less likely to cause problems. change the behaviour, as in we would not catch validation errors? No, nothing related to validation. If you validate a document, the DTDs will always be loaded. But parsing with or without XML_PARSE_DTDLOAD will obviously produce different results. It's hard to tell whether this will cause problems for users. But maybe I'm overly cautious. If someone parses a document without DTD flags, why would they assume that XIncluded documents are parsed with XML_PARSE_DTDLOAD? Too bad that xmlXIncludeParseFile() does not get the parent parserCtx, in that case we could apply the same flags'. I think the original flags are already passed via xmlXIncludeSetFlags. It seems that xmldict is only handling key and value to be a string, right? So, we'll even need out one cache data structure. I'd say it would need to be on the _xmlXIncludeCtxt level. global is easier, but then we can't free it ever :/ xmlHash should work fine: http://xmlsoft.org/html/libxml-hash.html But building a DTD cache would be the least of your problems. The hard part is to apply a cached DTD to a document. There are some interactions between internal and external subsets (see xmlAddElementDecl and xmlAddAttributeDecl in valid.c for example), so you it looks like you can't just simply set doc->extSubset to the cached DTD. You'd probably have to replay the calls to xmlAddElementDecl etc, maybe even in the original order which might be lost. That's why I wouldn't want to go down this route. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 05/17/2018 04:18 PM, Nick Wellnhofer wrote: > On 16/05/2018 21:51, Stefan Sauer wrote: >> So one solution could be another flag to enable this? > > Yes, but it would be rather ugly. > >> Thanks, reading the code. Need to figure where we could cache external >> subsets and what a suitable keys is (ExternalID ?). > > Note that I'm currently not planning to review and integrate larger > patches from other developers. I only took over some libxml2 > maintenance duties because noone else did. So even if you write a > high-quality patch, it might never get merged. > > Caching external subsets for XIncludes certainly sounds like a nice > feature but I would prefer to find a simpler solution. For example, > can't you just omit the external DTD from included documents? I've tried this and I get some interesting differences. If I modify my DOCTYPES declarations from e.g.: http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd; [ http://www.w3.org/2003/XInclude'"> %gtkdocentities; ]> to http://www.w3.org/2003/XInclude'"> %gtkdocentities; ]> and run (for each of the variants) xmllint --noent --xinclude tester-docs.xml >tester-docs.nodtd.xml then I get a lot of delta in this form: - http://www.w3.org/2003/XInclude; id="api-index-0.1" xml:base="xml/api-index-0.1.xml"> + basically if there is no DTD on the doctype, the resulting xi:include nodes won't have the xmlns:xi attribute. What is worse and puzzling me that it causes a small difference on the html output produced my xsltproc: -FOO, macro in GtkDocTestIf +FOO, macro in GtkDocTestIf if I drop the dtd, the first link misses 'class' and 'title' attributes and the 2nd link is not linked at all. Stefan > You wrote: > >> and gtk-doc will replicate this for the fragments (replacing 'book' with >> e.g. 'refentry'). This way one can e.g. inject things like a version. > > What do you mean by "inject things like a version"? Why exactly do > your included documents have to reference an external DTD? > > Another idea is to stop loading external DTDs for XIncludes without an > XPointer expression. This would still change the behavior for some > users but it's much less likely to cause problems. > > Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 06/07/2018 12:54 AM, Eric S. Eberhard wrote: > I know I am the oddball here but -- why use DTDs at all? I gave reasons above. I am working on a tool. How people using the tool is not under my control. Maybe we can focus on the opportunity to improve libxml2 a bit here. > I supply software to a lot of companies (thousands through > dealers). Many exchange millions of XML docs per day. I've used this > since it was libxml. Even have some patches in there. My application > is proprietary (meaning XML to get an order or tell a customer our > availability is simply XML I designed and documented and give to my > customer's customers (via download from a Web page)). Once they get > it working it pretty much always works. They write software to create > orders and send them to us -- it is consistent (I know, not everyone > has this luxury so this may not apply to everyone). So why check them? > > I also found that I was getting a gagillion support tickets because > DTDs ... simple things like a date ... seem to escape people -- take > June 7, 2018 > > In our date fields we will take: > Jun 7 2018 > June 7 2018 > the above with commas and any case (upper/lower/mixed) > 6/7/18 > 6/7/2018 > 2018/6/7 > 20180607 > 180606 > 06-07-18 > > And actually many many more. Anything that is a date goes through > this one routine and if there is any way in the world to extract a > date, we do. > > Ditto money -- say $1,245.56 > > We accept: > $1,245.56 > 1245.56 > 124556 (decimal is implied at 2 places if no decimal is > found) > 1,235.56 > > And many more - same thing, one routine reads it and if we can > possibly get a reasonable number, we do. > > This, in turn, reduced our CONSTANT support tickets for silly things > like a format of something to ZERO. Which I like. > > Even sicker -- we ignore case on tags. All of our XML is designed to > not use duplicate names with different cases (stupid thing to do > anyway -- expect orderNumber and OrderNumber to both be used, as > different things). > > As long as the customer is consistent and the XML is well formed we > scan the tree and compare tags without regard to case. A WHOLE LOT > more support tickets gone. > > A lot of the people we deal with are not sophisticated. As the > receiver of XML we decided it was much better to be as flexible as > possible and take what we can if at all possible. After all -- a DTD > can indeed tell you if an address comes in without a city name. And > reject it and usually generate a support ticket. Since we use an > on-line AVS system (more XML) and if we have the zip and the address > otherwise matches ... we don't need the city and state ... the AVS > system provides it. And if it fails they will get an error back from > us (from the application) anyway. So why use a DTD to see if the city > or state were sent? A LOT MORE support calls removed. > > And, of course, performance without the DTDs is much better. > > As a result we are able to give documentation to new customers and > they are able to get it up and running with little to no help. Any > serious errors we cannot fix are clearly explained in the responses BY > THE APPLICATION and not by a DTD. > > Being flexible on our end reduces support tickets which is all I > care. I would rather code for all the mistakes I can think of an > enduser would make (and we add new ones when they crop up) than be > strict and do a lot of support. We don't think DTDs are flexible > enough. And I hate making them :-) > > We do offer a page with DTDs they can use manually to check their > document if they like -- or they can send it to our test system. Once > they are running they seem to do just fine. > > As programmers it is hard to believe but sometimes it is better for us > to make slightly less efficient code in order to make the human aspect > much more efficient. I once had someone send me a link to a "contest" > which was a convoluted C statement and asking to solve what the result > would be. My response -- "fire the programmer!" > > If it takes 100s of competent C programmers to get the right answer > (and only a small percent did) to read a line of code -- it is bad > code. And for people's information, modern computers read ahead and > pre-execute code based on all kinds of weird logic. Simple C code is > easy for it to handle ... but convoluted code ends up stopping the > pre-execution and is actually slower -- may have less lines of code -- > but it will be slower. I see nothing wrong with short clear clean > code with as little craziness as possible. This is the same with XML > -- one can go overboard easily, K.I.S.S. :-) > > Not being so strict and no DTDs has had other benefits -- say EDI > (from old IBMs) -- we have a cheap program that maps EDI to XML and > back. So we can handle EDI -- and we don't need new software (after > the conversion). We accept the EDI, convert to XML, run our
Re: [xml] performance of parsing docbook with xincludes
On 05/17/2018 06:01 PM, Stefan Sauer wrote: > On 05/17/2018 04:18 PM, Nick Wellnhofer wrote: >> On 16/05/2018 21:51, Stefan Sauer wrote: >>> So one solution could be another flag to enable this? >> Yes, but it would be rather ugly. > In which sense? I guess because it is something that noone should need > to know about or have to care about? >>> Thanks, reading the code. Need to figure where we could cache external >>> subsets and what a suitable keys is (ExternalID ?). >> Note that I'm currently not planning to review and integrate larger >> patches from other developers. I only took over some libxml2 >> maintenance duties because noone else did. So even if you write a >> high-quality patch, it might never get merged. > Thanks for making this clear upfront. This is how I ended up becoming > the gtkdoc maintainer :) > >> Caching external subsets for XIncludes certainly sounds like a nice >> feature but I would prefer to find a simpler solution. For example, >> can't you just omit the external DTD from included documents? > Yeah, right now, the benefit of having the DTD is that one can validate > fragments. I'll do some research (aka grepping over existing projects) > to see how the doc-type headers being used today look like. If all that > people do is using an entity to inject the version, I'll write a > migration tool. > > We have a test that validates the doc, but I think I can change this to > just resolve all xincludes and check through the top-level doctype. Just to add to this, I am assuming a lot of people follow this book http://www.sagehill.net/docbookxsl/ModularDoc.html#UsingXinclude and using a DOCTYPE is part of the examples. >> You wrote: >> >>> and gtk-doc will replicate this for the fragments (replacing 'book' with >>> e.g. 'refentry'). This way one can e.g. inject things like a version. >> What do you mean by "inject things like a version"? Why exactly do >> your included documents have to reference an external DTD? > The documentation consists of a handwritten master doc (type book), that > includes more handwritten parts (e.g. tutorials, guides) and include > generated reference docs. When gtkdoc generated the reference docs, it > applies takes the doctype header of the master-doc as a template and > uses that for the generated reference docs. If the master doc has > entities declared, those can be expanded in the reference fragments. > Thats the part I will check how widely it is actually used. > > Stefan > >> Another idea is to stop loading external DTDs for XIncludes without an >> XPointer expression. This would still change the behavior for some >> users but it's much less likely to cause problems. change the behaviour, as in we would not catch validation errors? Too bad that xmlXIncludeParseFile() does not get the parent parserCtx, in that case we could apply the same flags'. >> >> Nick > I definitely don't know enough about the implications here. I was mostly > thinking to see if we can stick a dictionary of xmlDtdPtr> into the Parser Context and before actually loading a dtd, > check if we did already and reuse. Somehow the dict needs to be stored > in the top-level doc, when parsing is done (do we need the dtds once the > doc has been parsed?). We only free the dtds with the top-level doc. But > I agree, it is not going to be a two liner. It seems that xmldict is only handling key and value to be a string, right? So, we'll even need out one cache data structure. I'd say it would need to be on the _xmlXIncludeCtxt level. global is easier, but then we can't free it ever :/ Stefan > > Stefan > > > ___ > xml mailing list, project page http://xmlsoft.org/ > xml@gnome.org > https://mail.gnome.org/mailman/listinfo/xml ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 05/17/2018 04:18 PM, Nick Wellnhofer wrote: > On 16/05/2018 21:51, Stefan Sauer wrote: >> So one solution could be another flag to enable this? > > Yes, but it would be rather ugly. In which sense? I guess because it is something that noone should need to know about or have to care about? > >> Thanks, reading the code. Need to figure where we could cache external >> subsets and what a suitable keys is (ExternalID ?). > > Note that I'm currently not planning to review and integrate larger > patches from other developers. I only took over some libxml2 > maintenance duties because noone else did. So even if you write a > high-quality patch, it might never get merged. Thanks for making this clear upfront. This is how I ended up becoming the gtkdoc maintainer :) > > Caching external subsets for XIncludes certainly sounds like a nice > feature but I would prefer to find a simpler solution. For example, > can't you just omit the external DTD from included documents? Yeah, right now, the benefit of having the DTD is that one can validate fragments. I'll do some research (aka grepping over existing projects) to see how the doc-type headers being used today look like. If all that people do is using an entity to inject the version, I'll write a migration tool. We have a test that validates the doc, but I think I can change this to just resolve all xincludes and check through the top-level doctype. > You wrote: > >> and gtk-doc will replicate this for the fragments (replacing 'book' with >> e.g. 'refentry'). This way one can e.g. inject things like a version. > > What do you mean by "inject things like a version"? Why exactly do > your included documents have to reference an external DTD? The documentation consists of a handwritten master doc (type book), that includes more handwritten parts (e.g. tutorials, guides) and include generated reference docs. When gtkdoc generated the reference docs, it applies takes the doctype header of the master-doc as a template and uses that for the generated reference docs. If the master doc has entities declared, those can be expanded in the reference fragments. Thats the part I will check how widely it is actually used. Stefan > > Another idea is to stop loading external DTDs for XIncludes without an > XPointer expression. This would still change the behavior for some > users but it's much less likely to cause problems. > > Nick I definitely don't know enough about the implications here. I was mostly thinking to see if we can stick a dictionary ofinto the Parser Context and before actually loading a dtd, check if we did already and reuse. Somehow the dict needs to be stored in the top-level doc, when parsing is done (do we need the dtds once the doc has been parsed?). We only free the dtds with the top-level doc. But I agree, it is not going to be a two liner. Stefan ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 16/05/2018 21:51, Stefan Sauer wrote: So one solution could be another flag to enable this? Yes, but it would be rather ugly. Thanks, reading the code. Need to figure where we could cache external subsets and what a suitable keys is (ExternalID ?). Note that I'm currently not planning to review and integrate larger patches from other developers. I only took over some libxml2 maintenance duties because noone else did. So even if you write a high-quality patch, it might never get merged. Caching external subsets for XIncludes certainly sounds like a nice feature but I would prefer to find a simpler solution. For example, can't you just omit the external DTD from included documents? You wrote: and gtk-doc will replicate this for the fragments (replacing 'book' with e.g. 'refentry'). This way one can e.g. inject things like a version. What do you mean by "inject things like a version"? Why exactly do your included documents have to reference an external DTD? Another idea is to stop loading external DTDs for XIncludes without an XPointer expression. This would still change the behavior for some users but it's much less likely to cause problems. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 05/17/2018 01:40 AM, Eric S. Eberhard wrote: > So again, the first time, put it on your machine and use an IP (or > localhost if you can which goes straight to the TCP/IP stack -- never > goes to the network). Also, you can do one other tacky thing I do ... > sometimes when I see things like that and I don't need it, I just > change libxml2 -- and put comments so I can update. I have 2-3 of > those. libxml2 tries to satisfy the entire world ... you only need to > make you happy :-) You might also consider a static link if doing > that -- safer if customer loads a different version, and it keeps > loading itself from changing anything, and the program loads faster, etc. Sure, that's where we are now. I am looking for this change to help other developers, so I'll need to find a solution that can be merged into libxml2. But thanks for the input. Stefan > > Eric > > On 5/15/2018 3:42 AM, Nick Wellnhofer wrote: >> On 14/05/2018 21:48, Stefan Sauer wrote: >>> This part looks suspicious: >>> >>> |--22.98%--0xc2160 >>> | xmlFreeDoc >>> | | >>> | --22.42%--xmlFreeDtd >> >>> Can I tell it to not load dtds in the first place? Is it loading the >>> dtd for each an every xinclude? >> >> Good catch. It seems that the XInclude engine always parses included >> docs with XML_PARSE_DTDLOAD: >> >> https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450 >> >> If you're not using XML catalogs, this will probably cause the DTD to >> be loaded over the network multiple times which could explain the >> slowdown. >> >> Can you try to change the line to >> >> xmlCtxtUseOptions(pctxt, ctxt->parseFlags); >> >> and see if it helps? >> >> Nick >> ___ >> xml mailing list, project page http://xmlsoft.org/ >> xml@gnome.org >> https://mail.gnome.org/mailman/listinfo/xml >> > > -- > Eric S. Eberhard > VICS > 2933 W Middle Verde Road > Camp Verde, AZ 86322 > > 928-567-3727 work 928-301-7537 cell > > http://www.vicsmba.com/index.html (our work) > http://www.vicsmba.com/ourpics/index.html (fun pictures) ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 05/16/2018 12:41 AM, Nick Wellnhofer wrote: > On May 15, 2018, at 21:56 , Stefan Sauerwrote: >> On 05/15/2018 08:40 PM, Stefan Sauer wrote: >>> On 05/15/2018 12:42 PM, Nick Wellnhofer wrote: Can you try to change the line to xmlCtxtUseOptions(pctxt, ctxt->parseFlags); and see if it helps? >>> It does not help. I'll experiment further. Thanks for the recomendations. > I think you also have to remove the line at > https://git.gnome.org/browse/libxml2/tree/xinclude.c#n463 > > pctxt->loadsubset |= XML_DETECT_IDS; > > Looks like the idea is to make sure that ID attributes are detected for > XIncludes with XPointers. IMO, it should be the application's responsibility > to set the XML_PARSE_DTDLOAD flag in this case. But changing the behavior > might break code that relies on this feature. This helps! LD_LIBRARY_PATH=~/debug/lib ~/debug/bin/xmllint --timing --xinclude --nonet --noent --noout glib-docs.xml Parsing took 0 ms Xinclude processing took 179 ms Freeing took 17 ms So one solution could be another flag to enable this? >> Is libxml2 doing that for each file over and over? > Yes. Actually easy to confirm using --load-trace: https://gist.github.com/ensonic/e1c4c7f80a0c072d119a649722de1e20 >> Wouldn't it make sense to only load each dtd once? > This would make sense. > >> And where exatly is it loaded (I can only >> see xmlFreeDtd, but can't find a xmlLoadDtd or the like. > Via xmlParseDocument -> xmlSAX2ExternalSubset -> xmlParseExternalSubset. Thanks, reading the code. Need to figure where we could cache external subsets and what a suitable keys is (ExternalID ?). Stefan > > Nick > ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On May 15, 2018, at 21:56 , Stefan Sauerwrote: > > On 05/15/2018 08:40 PM, Stefan Sauer wrote: >> On 05/15/2018 12:42 PM, Nick Wellnhofer wrote: >>> Can you try to change the line to >>> >>> xmlCtxtUseOptions(pctxt, ctxt->parseFlags); >>> >>> and see if it helps? >>> >> It does not help. I'll experiment further. Thanks for the recomendations. I think you also have to remove the line at https://git.gnome.org/browse/libxml2/tree/xinclude.c#n463 pctxt->loadsubset |= XML_DETECT_IDS; Looks like the idea is to make sure that ID attributes are detected for XIncludes with XPointers. IMO, it should be the application's responsibility to set the XML_PARSE_DTDLOAD flag in this case. But changing the behavior might break code that relies on this feature. > Is libxml2 doing that for each file over and over? Yes. > Wouldn't it make sense to only load each dtd once? This would make sense. > And where exatly is it loaded (I can only > see xmlFreeDtd, but can't find a xmlLoadDtd or the like. Via xmlParseDocument -> xmlSAX2ExternalSubset -> xmlParseExternalSubset. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 05/15/2018 08:40 PM, Stefan Sauer wrote: > On 05/15/2018 12:42 PM, Nick Wellnhofer wrote: >> On 14/05/2018 21:48, Stefan Sauer wrote: >>> This part looks suspicious: >>> >>> |--22.98%--0xc2160 >>> | xmlFreeDoc >>> | | >>> | --22.42%--xmlFreeDtd >>> Can I tell it to not load dtds in the first place? Is it loading the >>> dtd for each an every xinclude? >> Good catch. It seems that the XInclude engine always parses included >> docs with XML_PARSE_DTDLOAD: >> >> https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450 >> >> If you're not using XML catalogs, this will probably cause the DTD to >> be loaded over the network multiple times which could explain the >> slowdown. >> >> Can you try to change the line to >> >> xmlCtxtUseOptions(pctxt, ctxt->parseFlags); >> >> and see if it helps? >> >> Nick > It does not help. I'll experiment further. Thanks for the recomendations. and FYI: a call grpah plot: https://imgur.com/a/d27xxor As an experiemnt I dropped the doctype headers for the (generated) xincluded files. So no it is 20 files with doctype headers + 105 (generated) files without doctype headers. And voila! xmllint --timing --xinclude --noout glib-docs.xml Parsing took 0 ms Xinclude processing took 447 ms Freeing took 19 ms The docbook header looks like this: http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd' [ http://www.w3.org/2003/XInclude'"> ]> and gtk-doc will replicate this for the fragments (replacing 'book' with e.g. 'refentry'). This way one can e.g. inject things like a version. I do have the /usr/share/xml/docbook/schema/dtd/4.5/docbookx.dtd locally available. I guess there is no way avoiding to loading the dtd then. Is libxml2 doing that for each file over and over? Wouldn't it make sense to only load each dtd once? And where exatly is it loaded (I can only see xmlFreeDtd, but can't find a xmlLoadDtd or the like. Sorry for all the questions, but it looks like there is low hanging fruit to save a lot of cpu time. Stefan > > > Stefan > > ___ > xml mailing list, project page http://xmlsoft.org/ > xml@gnome.org > https://mail.gnome.org/mailman/listinfo/xml ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 05/15/2018 12:42 PM, Nick Wellnhofer wrote: > On 14/05/2018 21:48, Stefan Sauer wrote: >> This part looks suspicious: >> >> |--22.98%--0xc2160 >> | xmlFreeDoc >> | | >> | --22.42%--xmlFreeDtd > >> Can I tell it to not load dtds in the first place? Is it loading the >> dtd for each an every xinclude? > > Good catch. It seems that the XInclude engine always parses included > docs with XML_PARSE_DTDLOAD: > > https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450 > > If you're not using XML catalogs, this will probably cause the DTD to > be loaded over the network multiple times which could explain the > slowdown. > > Can you try to change the line to > > xmlCtxtUseOptions(pctxt, ctxt->parseFlags); > > and see if it helps? > > Nick It does not help. I'll experiment further. Thanks for the recomendations. Stefan ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 14/05/2018 21:48, Stefan Sauer wrote: This part looks suspicious: |--22.98%--0xc2160 | xmlFreeDoc | | | --22.42%--xmlFreeDtd Can I tell it to not load dtds in the first place? Is it loading the dtd for each an every xinclude? Good catch. It seems that the XInclude engine always parses included docs with XML_PARSE_DTDLOAD: https://git.gnome.org/browse/libxml2/tree/xinclude.c#n450 If you're not using XML catalogs, this will probably cause the DTD to be loaded over the network multiple times which could explain the slowdown. Can you try to change the line to xmlCtxtUseOptions(pctxt, ctxt->parseFlags); and see if it helps? Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 05/14/2018 09:48 PM, Stefan Sauer wrote: > On 05/14/2018 12:19 PM, Nick Wellnhofer wrote: >> On 13/05/2018 20:54, Stefan Sauer wrote: >>> Lets look at some numbers using glib >>> (https://gitlab.gnome.org/GNOME/glib) >>> >>> cd glib/docs/reference/glib >>> xmllint --timing --xinclude --noout glib-docs.xml >>> Parsing took 0 ms >>> Xinclude processing took 4560 ms >>> Freeing took 91 ms >>> >>> Any idea how I can get more breakdown of whats happening in 'Xinclude >>> processing'? >> >> It seems that "XInclude processing" also contains the time needed to >> parse the included documents, so maybe the XIncludes aren't the issue >> at all (glib-docs.xml is a small document including several larger >> ones). Can you save glib-docs.xml after processing XIncludes and >> check whether parsing the consolidated document is considerably faster? >> >>> Running with "perf record -g -- xmllint --timing --xinclude --noout >>> glib-docs.xml" gets me such a report. >>> >>> + 17.15% 16.69% xmllint libc-2.24.so [.] _int_malloc >>> + 11.93% 11.87% xmllint libc-2.24.so [.] >>> malloc_consolidate >>> + 9.01% 8.97% xmllint libxml2.so.2.9.4 [.] xmlDictLookup >>> + 7.15% 0.00% xmllint ld-2.24.so [.] >>> 0x8021a0022010 >>> + 6.25% 6.21% xmllint libxml2.so.2.9.4 [.] xmlHashAddEntry3 >>> + 6.22% 0.00% xmllint libxml2.so.2.9.4 [.] >>> xmlSAX2IsStandalone >>> + 6.22% 0.00% xmllint [unknown] [.] >>> 0x56413c74c0854810 >>> + 3.95% 3.94% xmllint libxml2.so.2.9.4 [.] xmlHashLookup2 >>> 3.72% 3.70% xmllint libc-2.24.so [.] _int_free >>> + 3.28% 0.00% xmllint [unknown] [.] >>> + 3.06% 3.04% xmllint libxml2.so.2.9.4 [.] >>> xmlFreeDocElementContent >>> + 2.96% 2.91% xmllint libc-2.24.so [.] free >> >> The callgraph based reports (perf report -g or -G) are usually more >> helpful. > > This part looks suspicious: >|--22.98%--0xc2160 >| xmlFreeDoc >| | >| --22.42%--xmlFreeDtd >| | >| |--19.62%--xmlHashFree >| | | >| | |--10.03%--_int_free >| | | | >| | | > --9.56%--malloc_consolidate >| | | >| | |--3.69%--0x7e560 >| | | > xmlFreeDocElementContent >| | | | >| | | > --2.19%--xmlFreeDocElementContent >| | | >| | |--0.71%--0x7face >| | | >| | |--0.66%--0x30498 >| | | >| | --0.61%--0x7fae3 >| | xmlUnlinkNode >| | >| --0.89%--xmlFreeNode > > > Can I tell it to not load dtds in the first place? Is it loading the dtd for > each an every xinclude? > > Stefan All my xincluded files have doctype headers like: http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd; [ ]> If I remove them it seems to become faster. I'll see if I can programmatically strip them all to be really sure though. Stefan > >> >>> Any ideas. Is there a know issues with using xincludes here? >> >> It might be quadratic behavior in the XInclude engine or something >> else entirely. How large is glib-docs.xml after processing XIncludes? >> >> Nick > > > > > ___ > xml mailing list, project page http://xmlsoft.org/ > xml@gnome.org > https://mail.gnome.org/mailman/listinfo/xml ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 05/14/2018 12:19 PM, Nick Wellnhofer wrote: > On 13/05/2018 20:54, Stefan Sauer wrote: >> Lets look at some numbers using glib >> (https://gitlab.gnome.org/GNOME/glib) >> >> cd glib/docs/reference/glib >> xmllint --timing --xinclude --noout glib-docs.xml >> Parsing took 0 ms >> Xinclude processing took 4560 ms >> Freeing took 91 ms >> >> Any idea how I can get more breakdown of whats happening in 'Xinclude >> processing'? > > It seems that "XInclude processing" also contains the time needed to > parse the included documents, so maybe the XIncludes aren't the issue > at all (glib-docs.xml is a small document including several larger > ones). Can you save glib-docs.xml after processing XIncludes and check > whether parsing the consolidated document is considerably faster? > >> Running with "perf record -g -- xmllint --timing --xinclude --noout >> glib-docs.xml" gets me such a report. >> >> + 17.15% 16.69% xmllint libc-2.24.so [.] _int_malloc >> + 11.93% 11.87% xmllint libc-2.24.so [.] >> malloc_consolidate >> + 9.01% 8.97% xmllint libxml2.so.2.9.4 [.] xmlDictLookup >> + 7.15% 0.00% xmllint ld-2.24.so [.] >> 0x8021a0022010 >> + 6.25% 6.21% xmllint libxml2.so.2.9.4 [.] xmlHashAddEntry3 >> + 6.22% 0.00% xmllint libxml2.so.2.9.4 [.] >> xmlSAX2IsStandalone >> + 6.22% 0.00% xmllint [unknown] [.] >> 0x56413c74c0854810 >> + 3.95% 3.94% xmllint libxml2.so.2.9.4 [.] xmlHashLookup2 >> 3.72% 3.70% xmllint libc-2.24.so [.] _int_free >> + 3.28% 0.00% xmllint [unknown] [.] >> + 3.06% 3.04% xmllint libxml2.so.2.9.4 [.] >> xmlFreeDocElementContent >> + 2.96% 2.91% xmllint libc-2.24.so [.] free > > The callgraph based reports (perf report -g or -G) are usually more > helpful. This part looks suspicious: |--22.98%--0xc2160 | xmlFreeDoc | | | --22.42%--xmlFreeDtd | | | |--19.62%--xmlHashFree | | | | | |--10.03%--_int_free | | | | | | | --9.56%--malloc_consolidate | | | | | |--3.69%--0x7e560 | | | xmlFreeDocElementContent | | | | | | | --2.19%--xmlFreeDocElementContent | | | | | |--0.71%--0x7face | | | | | |--0.66%--0x30498 | | | | | --0.61%--0x7fae3 | | xmlUnlinkNode | | | --0.89%--xmlFreeNode Can I tell it to not load dtds in the first place? Is it loading the dtd for each an every xinclude? Stefan > >> Any ideas. Is there a know issues with using xincludes here? > > It might be quadratic behavior in the XInclude engine or something > else entirely. How large is glib-docs.xml after processing XIncludes? > > Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
Seems that the list does not like attachements, so resensing with links On 05/14/2018 12:19 PM, Nick Wellnhofer wrote: > On 13/05/2018 20:54, Stefan Sauer wrote: >> Lets look at some numbers using glib >> (https://gitlab.gnome.org/GNOME/glib) >> >> cd glib/docs/reference/glib >> xmllint --timing --xinclude --noout glib-docs.xml >> Parsing took 0 ms >> Xinclude processing took 4560 ms >> Freeing took 91 ms >> >> Any idea how I can get more breakdown of whats happening in 'Xinclude >> processing'? > > It seems that "XInclude processing" also contains the time needed to > parse the included documents, so maybe the XIncludes aren't the issue > at all (glib-docs.xml is a small document including several larger > ones). Can you save glib-docs.xml after processing XIncludes and check > whether parsing the consolidated document is considerably faster? $ xmllint --timing --xinclude glib-docs.xml >glib-full.xml Parsing took 0 ms Xinclude processing took 4726 ms Saving took 85 ms Freeing took 104 ms $ xmllint --timing --noout glib-full.xml Parsing took 151 ms Freeing took 14 ms $ ls -alh glib-full.xml -rw-r--r-- 1 ensonic users 6.8M May 14 12:42 glib-full.xml Parsing the consolidated doc is a magnitude faster. Thanks for suggesting this test. $ xtime sh -c 'find . -name "*.xml" | grep -v version.xml | xargs xmllint --noout' 0.40u 0.04s 0.44r 70296kB sh -c find . -name "*.xml" | grep -v version.xml | xargs xmllint --noout Even parsing all files like this is >10 times faster. > >> Running with "perf record -g -- xmllint --timing --xinclude --noout >> glib-docs.xml" gets me such a report. >> >> + 17.15% 16.69% xmllint libc-2.24.so [.] _int_malloc >> + 11.93% 11.87% xmllint libc-2.24.so [.] >> malloc_consolidate >> + 9.01% 8.97% xmllint libxml2.so.2.9.4 [.] xmlDictLookup >> + 7.15% 0.00% xmllint ld-2.24.so [.] >> 0x8021a0022010 >> + 6.25% 6.21% xmllint libxml2.so.2.9.4 [.] xmlHashAddEntry3 >> + 6.22% 0.00% xmllint libxml2.so.2.9.4 [.] >> xmlSAX2IsStandalone >> + 6.22% 0.00% xmllint [unknown] [.] >> 0x56413c74c0854810 >> + 3.95% 3.94% xmllint libxml2.so.2.9.4 [.] xmlHashLookup2 >> 3.72% 3.70% xmllint libc-2.24.so [.] _int_free >> + 3.28% 0.00% xmllint [unknown] [.] >> + 3.06% 3.04% xmllint libxml2.so.2.9.4 [.] >> xmlFreeDocElementContent >> + 2.96% 2.91% xmllint libc-2.24.so [.] free > > The callgraph based reports (perf report -g or -G) are usually more > helpful. here is a full perf.log with callgraph. https://gist.github.com/ensonic/a73608edd60a995374e0961d3840eb4f > >> Any ideas. Is there a know issues with using xincludes here? > > It might be quadratic behavior in the XInclude engine or something > else entirely. How large is glib-docs.xml after processing XIncludes? See above. Btw. only the toplevel doc is using xincludes, the included docs don't include other docs: $grep "xi:inc" glib-docs.xml | wc -l 116 > > Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] performance of parsing docbook with xincludes
On 13/05/2018 20:54, Stefan Sauer wrote: Lets look at some numbers using glib (https://gitlab.gnome.org/GNOME/glib) cd glib/docs/reference/glib xmllint --timing --xinclude --noout glib-docs.xml Parsing took 0 ms Xinclude processing took 4560 ms Freeing took 91 ms Any idea how I can get more breakdown of whats happening in 'Xinclude processing'? It seems that "XInclude processing" also contains the time needed to parse the included documents, so maybe the XIncludes aren't the issue at all (glib-docs.xml is a small document including several larger ones). Can you save glib-docs.xml after processing XIncludes and check whether parsing the consolidated document is considerably faster? Running with "perf record -g -- xmllint --timing --xinclude --noout glib-docs.xml" gets me such a report. + 17.15% 16.69% xmllint libc-2.24.so [.] _int_malloc + 11.93% 11.87% xmllint libc-2.24.so [.] malloc_consolidate + 9.01% 8.97% xmllint libxml2.so.2.9.4 [.] xmlDictLookup + 7.15% 0.00% xmllint ld-2.24.so [.] 0x8021a0022010 + 6.25% 6.21% xmllint libxml2.so.2.9.4 [.] xmlHashAddEntry3 + 6.22% 0.00% xmllint libxml2.so.2.9.4 [.] xmlSAX2IsStandalone + 6.22% 0.00% xmllint [unknown] [.] 0x56413c74c0854810 + 3.95% 3.94% xmllint libxml2.so.2.9.4 [.] xmlHashLookup2 3.72% 3.70% xmllint libc-2.24.so [.] _int_free + 3.28% 0.00% xmllint [unknown] [.] + 3.06% 3.04% xmllint libxml2.so.2.9.4 [.] xmlFreeDocElementContent + 2.96% 2.91% xmllint libc-2.24.so [.] free The callgraph based reports (perf report -g or -G) are usually more helpful. Any ideas. Is there a know issues with using xincludes here? It might be quadratic behavior in the XInclude engine or something else entirely. How large is glib-docs.xml after processing XIncludes? Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
[xml] performance of parsing docbook with xincludes
hi, I am the maintainer of gtk-doc. One biggest complaint I get is the performance. gtk-doc is scanning sources and combining the extracted comments with handwritten docbook into a signle docbook document. The docbook document uses xinclude for its parts. As a next step we were using the docbook-stylesheets to generate reference docs as html (and dblatex for pdfs). So far I blamed the xslt processing for the low performance and since about a quarter I am working on a (python) tool in gtk-doc that reads the docbook with lxml (xml module that uses libxml2) and then walks the tree a few time and produces chunked html similar to the docbook stylesheets. The tool is getting feature complete and is up to 10 times faster (despite python). One reason I believed xslt is slow is that it is single threaded and when I added multi-threding/processing to my python tool I was puzzled that it does not get much faster. At this point I added some benchmarking and found out that the biggest chunk of time it spend on loading the xml. Lets look at some numbers using glib (https://gitlab.gnome.org/GNOME/glib) cd glib/docs/reference/glib xmllint --timing --xinclude --noout glib-docs.xml Parsing took 0 ms Xinclude processing took 4560 ms Freeing took 91 ms Any idea how I can get more breakdown of whats happening in 'Xinclude processing'? Running with "perf record -g -- xmllint --timing --xinclude --noout glib-docs.xml" gets me such a report. + 17.15% 16.69% xmllint libc-2.24.so [.] _int_malloc + 11.93% 11.87% xmllint libc-2.24.so [.] malloc_consolidate + 9.01% 8.97% xmllint libxml2.so.2.9.4 [.] xmlDictLookup + 7.15% 0.00% xmllint ld-2.24.so [.] 0x8021a0022010 + 6.25% 6.21% xmllint libxml2.so.2.9.4 [.] xmlHashAddEntry3 + 6.22% 0.00% xmllint libxml2.so.2.9.4 [.] xmlSAX2IsStandalone + 6.22% 0.00% xmllint [unknown] [.] 0x56413c74c0854810 + 3.95% 3.94% xmllint libxml2.so.2.9.4 [.] xmlHashLookup2 3.72% 3.70% xmllint libc-2.24.so [.] _int_free + 3.28% 0.00% xmllint [unknown] [.] + 3.06% 3.04% xmllint libxml2.so.2.9.4 [.] xmlFreeDocElementContent + 2.96% 2.91% xmllint libc-2.24.so [.] free Trying a different allocator seems to help quite a bit too (xtime is an alias for /usr/bin/time -f '%Uu %Ss %er %MkB %C' "$@") rm html-build.stamp; ~/bin/xtime make docs 53.28u 0.99s 54.70r 202372kB make docs rm html-build.stamp; LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4.3.0 ~/bin/xtime make docs 42.48u 1.54s 44.48r 185404kB make docs -> saves ~11sec when using the original toolchain (libxml2 + libxslt with docbook-stylesheets) ~/bin/xtime python3 ~/projects/gnome/gtk-doc/gtkdoc-mkhtml2 glib glib-docs.xml 7.01u 0.25s 7.27r 146068kB python3 /home/ensonic/projects/gnome/gtk-doc/gtkdoc-mkhtml2 glib glib-docs.xml LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4.3.0 ~/bin/xtime python3 ~/projects/gnome/gtk-doc/gtkdoc-mkhtml2 glib glib-docs.xml 5.69u 0.39s 6.10r 137340kB python3 /home/ensonic/projects/gnome/gtk-doc/gtkdoc-mkhtml2 glib glib-docs.xml -> saves ~1.5sec with my new toolchain (mostly on the loading xml side). Any ideas. Is there a know issues with using xincludes here? Stefan ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml