Re: [xml] performance of parsing docbook with xincludes

2018-06-07 Thread Nick Wellnhofer

On 07/06/2018 00:00, Stefan Sauer wrote:

Another idea is to stop loading external DTDs for XIncludes without an
XPointer expression. This would still change the behavior for some
users but it's much less likely to cause problems.

change the behaviour, as in we would not catch validation errors?


No, nothing related to validation. If you validate a document, the DTDs will 
always be loaded. But parsing with or without XML_PARSE_DTDLOAD will obviously 
produce different results. It's hard to tell whether this will cause problems 
for users. But maybe I'm overly cautious. If someone parses a document without 
DTD flags, why would they assume that XIncluded documents are parsed with 
XML_PARSE_DTDLOAD?



Too bad that xmlXIncludeParseFile() does not get the parent parserCtx,
in that case we could apply the same flags'.


I think the original flags are already passed via xmlXIncludeSetFlags.


It seems that xmldict is only handling key and value to be a string,
right? So, we'll even need out one cache data structure. I'd say it
would need to be on the _xmlXIncludeCtxt level. global is easier, but
then we can't free it ever :/


xmlHash should work fine:

http://xmlsoft.org/html/libxml-hash.html

But building a DTD cache would be the least of your problems. The hard part is 
to apply a cached DTD to a document. There are some interactions between 
internal and external subsets (see xmlAddElementDecl and xmlAddAttributeDecl 
in valid.c for example), so you it looks like you can't just simply set 
doc->extSubset to the cached DTD. You'd probably have to replay the calls to 
xmlAddElementDecl etc, maybe even in the original order which might be lost. 
That's why I wouldn't want to go down this route.


Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-06-07 Thread Stefan Sauer
On 05/17/2018 04:18 PM, Nick Wellnhofer wrote:
> On 16/05/2018 21:51, Stefan Sauer wrote:
>> So one solution could be another flag to enable this?
>
> Yes, but it would be rather ugly.
>
>> Thanks, reading the code. Need to figure where we could cache external
>> subsets and what a suitable keys is (ExternalID ?).
>
> Note that I'm currently not planning to review and integrate larger
> patches from other developers. I only took over some libxml2
> maintenance duties because noone else did. So even if you write a
> high-quality patch, it might never get merged.
>
> Caching external subsets for XIncludes certainly sounds like a nice
> feature but I would prefer to find a simpler solution. For example,
> can't you just omit the external DTD from included documents?
I've tried this and I get some interesting differences. If I modify my
DOCTYPES declarations from e.g.:
http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd;
[
  http://www.w3.org/2003/XInclude'">
  
  %gtkdocentities;
]>

to

http://www.w3.org/2003/XInclude'">
  
  %gtkdocentities;
]>

and run (for each of the variants)
xmllint --noent --xinclude tester-docs.xml >tester-docs.nodtd.xml
then I get a lot of delta in this form:
-    http://www.w3.org/2003/XInclude;
id="api-index-0.1" xml:base="xml/api-index-0.1.xml">
+    

basically if there is no DTD on the doctype, the resulting xi:include
nodes won't have the xmlns:xi attribute.

What is worse and puzzling me that it causes a small difference on the
html output produced my xsltproc:
-FOO, macro in GtkDocTestIf
+FOO, macro in GtkDocTestIf

if I drop the dtd, the first link misses 'class' and 'title' attributes
and the 2nd link is not linked at all.

Stefan

> You wrote:
>
>> and gtk-doc will replicate this for the fragments (replacing 'book' with
>> e.g. 'refentry'). This way one can e.g. inject things like a version.
>
> What do you mean by "inject things like a version"? Why exactly do
> your included documents have to reference an external DTD?
>
> Another idea is to stop loading external DTDs for XIncludes without an
> XPointer expression. This would still change the behavior for some
> users but it's much less likely to cause problems.
>
> Nick



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] performance of parsing docbook with xincludes

2018-06-07 Thread Stefan Sauer
On 06/07/2018 12:54 AM, Eric S. Eberhard wrote:
> I know I am the oddball here but -- why use DTDs at all?

I gave reasons above. I am working on a tool. How people using the tool
is not under my control. Maybe we can focus on the opportunity to
improve libxml2 a bit here.

>   I supply software to a lot of companies (thousands through
> dealers).  Many exchange millions of XML docs per day.  I've used this
> since it was libxml.  Even have some patches in there.  My application
> is proprietary (meaning XML to get an order or tell a customer our
> availability is simply XML I designed and documented and give to my
> customer's customers (via download from a Web page)).  Once they get
> it working it pretty much always works.  They write software to create
> orders and send them to us -- it is consistent (I know, not everyone
> has this luxury so this may not apply to everyone).  So why check them?
>
> I also found that I was getting a gagillion support tickets because
> DTDs ... simple things like a date ... seem to escape people -- take
> June 7, 2018
>
> In our date fields we will take:
>     Jun 7 2018
>     June 7 2018
>     the above with commas and any case (upper/lower/mixed)
>     6/7/18
>     6/7/2018
>     2018/6/7
>     20180607
>     180606
>  06-07-18
>
> And actually many many more.  Anything that is a date goes through
> this one routine and if there is any way in the world to extract a
> date, we do.
>
> Ditto money -- say $1,245.56
>
> We accept:
>     $1,245.56
>   1245.56
>   124556        (decimal is implied at 2 places if no decimal is
> found)
>        1,235.56
>
> And many more - same thing, one routine reads it and if we can
> possibly get a reasonable number, we do.
>
> This, in turn, reduced our CONSTANT support tickets for silly things
> like a format of something to ZERO.  Which I like.
>
> Even sicker -- we ignore case on tags.  All of our XML is designed to
> not use duplicate names with different cases (stupid thing to do
> anyway -- expect orderNumber and OrderNumber to both be used, as
> different things).
>
> As long as the customer is consistent and the XML is well formed we
> scan the tree and compare tags without regard to case.  A WHOLE LOT
> more support tickets gone.
>
> A lot of the people we deal with are not sophisticated.  As the
> receiver of XML we decided it was much better to be as flexible as
> possible and take what we can if at all possible.  After all -- a DTD
> can indeed tell you if an address comes in without a city name.  And
> reject it and usually generate a support ticket.  Since we use an
> on-line AVS system (more XML) and if we have the zip and the address
> otherwise matches ... we don't need the city and state ... the AVS
> system provides it.  And if it fails they will get an error back from
> us (from the application) anyway.  So why use a DTD to see if the city
> or state were sent?  A LOT MORE support calls removed.
>
> And, of course, performance without the DTDs is much better.
>
> As a result we are able to give documentation to new customers and
> they are able to get it up and running with little to no help.  Any
> serious errors we cannot fix are clearly explained in the responses BY
> THE APPLICATION and not by a DTD.
>
> Being flexible on our end reduces support tickets which is all I
> care.  I would rather code for all the mistakes I can think of an
> enduser would make (and we add new ones when they crop up) than be
> strict and do a lot of support.  We don't think DTDs are flexible
> enough.  And I hate making them :-)
>
> We do offer a page with DTDs they can use manually to check their
> document if they like -- or they can send it to our test system.  Once
> they are running they seem to do just fine.
>
> As programmers it is hard to believe but sometimes it is better for us
> to make slightly less efficient code in order to make the human aspect
> much more efficient.  I once had someone send me a link to a "contest"
> which was a convoluted C statement and asking to solve what the result
> would be.  My response -- "fire the programmer!"
>
> If it takes 100s of competent C programmers to get the right answer
> (and only a small percent did) to read a line of code -- it is bad
> code.  And for people's information, modern computers read ahead and
> pre-execute code based on all kinds of weird logic.  Simple C code is
> easy for it to handle ... but convoluted code ends up stopping the
> pre-execution and is actually slower -- may have less lines of code --
> but it will be slower.  I see nothing wrong with short clear clean
> code with as little craziness as possible.  This is the same with XML
> -- one c