Re: [xml] performance of parsing docbook with xincludes

Stefan Sauer Thu, 07 Jun 2018 00:23:15 -0700

On 06/07/2018 12:54 AM, Eric S. Eberhard wrote:
> I know I am the oddball here but -- why use DTDs at all?


I gave reasons above. I am working on a tool. How people using the tool
is not under my control. Maybe we can focus on the opportunity to
improve libxml2 a bit here.

>   I supply software to a lot of companies (thousands through
> dealers).  Many exchange millions of XML docs per day.  I've used this
> since it was libxml.  Even have some patches in there.  My application
> is proprietary (meaning XML to get an order or tell a customer our
> availability is simply XML I designed and documented and give to my
> customer's customers (via download from a Web page)).  Once they get
> it working it pretty much always works.  They write software to create
> orders and send them to us -- it is consistent (I know, not everyone
> has this luxury so this may not apply to everyone).  So why check them?
>
> I also found that I was getting a gagillion support tickets because
> DTDs ... simple things like a date ... seem to escape people -- take
> June 7, 2018
>
> In our date fields we will take:
>     Jun 7 2018
>     June 7 2018
>     the above with commas and any case (upper/lower/mixed)
>     6/7/18
>     6/7/2018
>     2018/6/7
>     20180607
>     180606
>      06-07-18
>
> And actually many many more.  Anything that is a date goes through
> this one routine and if there is any way in the world to extract a
> date, we do.
>
> Ditto money -- say $1,245.56
>
> We accept:
>     $1,245.56
>       1245.56
>       124556        (decimal is implied at 2 places if no decimal is
> found)
>        1,235.56
>
> And many more - same thing, one routine reads it and if we can
> possibly get a reasonable number, we do.
>
> This, in turn, reduced our CONSTANT support tickets for silly things
> like a format of something to ZERO.  Which I like.
>
> Even sicker -- we ignore case on tags.  All of our XML is designed to
> not use duplicate names with different cases (stupid thing to do
> anyway -- expect orderNumber and OrderNumber to both be used, as
> different things).
>
> As long as the customer is consistent and the XML is well formed we
> scan the tree and compare tags without regard to case.  A WHOLE LOT
> more support tickets gone.
>
> A lot of the people we deal with are not sophisticated.  As the
> receiver of XML we decided it was much better to be as flexible as
> possible and take what we can if at all possible.  After all -- a DTD
> can indeed tell you if an address comes in without a city name.  And
> reject it and usually generate a support ticket.  Since we use an
> on-line AVS system (more XML) and if we have the zip and the address
> otherwise matches ... we don't need the city and state ... the AVS
> system provides it.  And if it fails they will get an error back from
> us (from the application) anyway.  So why use a DTD to see if the city
> or state were sent?  A LOT MORE support calls removed.
>
> And, of course, performance without the DTDs is much better.
>
> As a result we are able to give documentation to new customers and
> they are able to get it up and running with little to no help.  Any
> serious errors we cannot fix are clearly explained in the responses BY
> THE APPLICATION and not by a DTD.
>
> Being flexible on our end reduces support tickets which is all I
> care.  I would rather code for all the mistakes I can think of an
> enduser would make (and we add new ones when they crop up) than be
> strict and do a lot of support.  We don't think DTDs are flexible
> enough.  And I hate making them :-)
>
> We do offer a page with DTDs they can use manually to check their
> document if they like -- or they can send it to our test system.  Once
> they are running they seem to do just fine.
>
> As programmers it is hard to believe but sometimes it is better for us
> to make slightly less efficient code in order to make the human aspect
> much more efficient.  I once had someone send me a link to a "contest"
> which was a convoluted C statement and asking to solve what the result
> would be.  My response -- "fire the programmer!"
>
> If it takes 100s of competent C programmers to get the right answer
> (and only a small percent did) to read a line of code -- it is bad
> code.  And for people's information, modern computers read ahead and
> pre-execute code based on all kinds of weird logic.  Simple C code is
> easy for it to handle ... but convoluted code ends up stopping the
> pre-execution and is actually slower -- may have less lines of code --
> but it will be slower.  I see nothing wrong with short clear clean
> code with as little craziness as possible.  This is the same with XML
> -- one can go overboard easily, K.I.S.S.  :-)
>
> Not being so strict and no DTDs has had other benefits -- say EDI
> (from old IBMs) -- we have a cheap program that maps EDI to XML and
> back.  So we can handle EDI -- and we don't need new software (after
> the conversion).  We accept the EDI, convert to XML, run our standard
> application, create XML response, which is converted to EDI.  The
> package we use is low cost and no, it won't work too well with DTDs as
> EDI has it's own problems. 
>
> I could go on but most of you have probably skipped this post by now :-)
>
> E
>
> On 6/6/2018 3:00 PM, Stefan Sauer wrote:
>> On 05/17/2018 06:01 PM, Stefan Sauer wrote:
>>   
>>> On 05/17/2018 04:18 PM, Nick Wellnhofer wrote:
>>>     
>>>> On 16/05/2018 21:51, Stefan Sauer wrote:
>>>>       
>>>>> So one solution could be another flag to enable this?
>>>>>         
>>>> Yes, but it would be rather ugly.
>>>>       
>>> In which sense? I guess because it is something that noone should need
>>> to know about or have to care about?
>>>     
>>>>> Thanks, reading the code. Need to figure where we could cache external
>>>>> subsets and what a suitable keys is (ExternalID ?).
>>>>>         
>>>> Note that I'm currently not planning to review and integrate larger
>>>> patches from other developers. I only took over some libxml2
>>>> maintenance duties because noone else did. So even if you write a
>>>> high-quality patch, it might never get merged.
>>>>       
>>> Thanks for making this clear upfront. This is how I ended up becoming
>>> the gtkdoc maintainer :)
>>>
>>>     
>>>> Caching external subsets for XIncludes certainly sounds like a nice
>>>> feature but I would prefer to find a simpler solution. For example,
>>>> can't you just omit the external DTD from included documents?
>>>>       
>>> Yeah, right now, the benefit of having the DTD is that one can validate
>>> fragments. I'll do some research (aka grepping over existing projects)
>>> to see how the doc-type headers being used today look like. If all that
>>> people do is using an entity to inject the version, I'll write a
>>> migration tool.
>>>
>>> We have a test that validates the doc, but I think I can change this to
>>> just resolve all xincludes and check through the top-level doctype.
>>>     
>> Just to add to this, I am assuming a lot of people follow this book
>> http://www.sagehill.net/docbookxsl/ModularDoc.html#UsingXinclude
>>
>> and using a DOCTYPE is part of the examples.
>>   
>>>> You wrote:
>>>>
>>>>       
>>>>> and gtk-doc will replicate this for the fragments (replacing 'book' with
>>>>> e.g. 'refentry'). This way one can e.g. inject things like a version.
>>>>>         
>>>> What do you mean by "inject things like a version"? Why exactly do
>>>> your included documents have to reference an external DTD?
>>>>       
>>> The documentation consists of a handwritten master doc (type book), that
>>> includes more handwritten parts (e.g. tutorials, guides) and include
>>> generated reference docs. When gtkdoc generated the reference docs, it
>>> applies takes the doctype header of the master-doc as a template and
>>> uses that for the generated reference docs. If the master doc has
>>> entities declared, those can be expanded in the reference fragments.
>>> Thats the part I will check how widely it is actually used.
>>>
>>> Stefan
>>>
>>>     
>>>> Another idea is to stop loading external DTDs for XIncludes without an
>>>> XPointer expression. This would still change the behavior for some
>>>> users but it's much less likely to cause problems.
>>>>       
>> change the behaviour, as in we would not catch validation errors?
>> Too bad that xmlXIncludeParseFile() does not get the parent parserCtx,
>> in that case we could apply the same flags'.
>>   
>>>> Nick
>>>>       
>>> I definitely don't know enough about the implications here. I was mostly
>>> thinking to see if we can stick a dictionary of <dtd-identifier,
>>> xmlDtdPtr> into the Parser Context and before actually loading a dtd,
>>> check if we did already and reuse. Somehow the dict needs to be stored
>>> in the top-level doc, when parsing is done (do we need the dtds once the
>>> doc has been parsed?). We only free the dtds with the top-level doc. But
>>> I agree, it is not going to be a two liner.
>>>     
>> It seems that xmldict is only handling key and value to be a string,
>> right? So, we'll even need out one cache data structure. I'd say it
>> would need to be on the _xmlXIncludeCtxt level. global is easier, but
>> then we can't free it ever :/
>>
>> Stefan
>>   
>>> Stefan
>>>
>>>
>>> _______________________________________________
>>> xml mailing list, project page  http://xmlsoft.org/
>>> [email protected]
>>> https://mail.gnome.org/mailman/listinfo/xml
>>>     
>>
>> _______________________________________________
>> xml mailing list, project page  http://xmlsoft.org/
>> [email protected]
>> https://mail.gnome.org/mailman/listinfo/xml
>>
>>   
>
> -- 
> Eric S. Eberhard
> VICS
> 2933 W Middle Verde Road
> Camp Verde, AZ  86322
>
> 928-567-3727  work                      928-301-7537  cell
>
> http://www.vicsmba.com/index.html             (our work)
> http://www.vicsmba.com/ourpics/index.html     (fun pictures)

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
https://mail.gnome.org/mailman/listinfo/xml

Re: [xml] performance of parsing docbook with xincludes

Reply via email to