Re: [xml] Schema validation skipping IDC

2022-02-09 Thread Stefan de Konink

On Wednesday, February 9, 2022 1:25:41 PM CET, Nick Wellnhofer wrote:
I'm always reluctant to add new features, especially if it 
sounds like it only solves a problem for a single user. Do you 
want to disable checking of identity constraints for performance 
reasons or is there another use case?


They are indeed based on performance reasons, where the syntax validation 
is extremely fast and powerful (even single threaded, as expected), but IDC 
is (for the size of our documents) costly. For the validation and 
implementation of our standard the IDC validation is 'too soon' for most of 
the parties implementing it. Especially when the syntax validation fails, 
it does not make sense to continue in the second stage.


Like Eric pointed out; to support this use case now it requires two 
schema's one with and one without. Since our schema consists of 384 
individual xsd's that is less trivial to search and replace on the fly.


From a software perspective I would love to have the option of a single 
schema, where just the IDC functionality can be disabled, or even "per 
type" could be disabled. Even if the IDC performance would be fixed, I 
don't see it fixed in a way it would be considered a no-op.


--
Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


[xml] Schema validation skipping IDC

2022-02-01 Thread Stefan de Konink

Hi,

Would a patch be accepted that would create an option to disable identity 
constraints at runtime? Use case: only syntactically validate a file.


--
Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Resuming maintenance

2022-01-12 Thread Stefan de Konink

Hi Nick,

On Wednesday, January 12, 2022 3:49:07 PM CET, Nick Wellnhofer wrote:
I didn't make any performance improvements to the XSD code 
personally. You're probably seeing improvements from the 
following commit which wasn't authored by me:


https://gitlab.gnome.org/GNOME/libxml2/-/commit/faea2fa9


Exactly.


If you're seeing degraded performance on large documents, it's 
likely another issue with quadratic runtime. Fixing such issues 
algorithmically should typically yield much better results than 
trying to work around them with multi-threading.


What can I do to identify these thing in a usable way? Would a profiler 
help in this case?


--
Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Resuming maintenance

2022-01-10 Thread Stefan de Konink

Dear Nick,

This is great news, thanks Google for acknowledging the importance of 
maintaining core open source products. Your previous improvements on XSD 
validation made a great difference, but from my prototype in Python (LXML) 
I assume that multithreaded constraint validation and a more efficient way 
of storage would gain additional performance on files larger than 500MB. 
One may ask if no 'green fund' would be able to donate money on these type 
of improvements.


--
Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Constraint validation for huge documents

2021-01-06 Thread Stefan de Konink

Hi Liam,

On Wednesday, January 6, 2021 2:35:53 AM CET, Liam R E Quin wrote:

Could you do this instead using schematron?


Would you have an example how to do a key identity constraint with 
schematron? I am happy to benchmark it.


--
Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Constraint validation for huge documents

2021-01-05 Thread Stefan de Konink

Hi Nick,


Thanks for your reply. It does have a noticeable impact, while having 
compiled libxml2-git yesterday, I oversaw it.



With the single constraint file;
libxml2-2.9.10
User time (seconds): 90.81
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:31.60

libxml2-git
User time (seconds): 49.57
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:50.57

With the full constraint file;
libxml2-2.9.10
Not completed after 1 hour 30 min

libxml2-git
User time (seconds): 900.60
Elapsed (wall clock) time (h:mm:ss or m:ss): 15:02.87


Yesterday I wrote a custom validator in lxml for key/keyref and unique 
constraints. It basically validates syntactically using the normal libxml2 
code, and then fetches all constraints (this might be a shortcut), creates 
a hashset per constraint. This process can be executed in parallel per 
constraint. If taking into account the number of elements (by heuristics, 
if the same xsd is used over time) parallelism can be ensured over a longer 
period.


With multithreading (8):
User time (seconds): 1136.37
Percent of CPU this job got: 388%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.09

Without multithreading:

User time (seconds): 709.82
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:52.15



I assume that the optimisation currently present in git is a serious 
improvement. Sure, it is still not 'perfect' but I think that doing the 
validation in parallel might be something worthwhile to explore.


--
Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


[xml] Constraint validation for huge documents

2021-01-04 Thread Stefan de Konink

Hello,


I am working in a project that aims for validating open data by an open 
standard defined in an XML Schema[1]. The document size varies from 13kB - 
2GB[2]. The basic problem I face is key constraint validation, defined as 
key, keyref and unique combinations. The special case here is that most of 
our validation consists of a compound key: meaning they have an ID and 
version, and should match a foreign object with that same pair. To 
illustrate:





   
   Every [StopPointInJourneyPattern Id + 
Version + order] must be unique within document.

   
   
   
   
   


refer="netex:StopPointInJourneyPattern_AnyVersionedKey_ordered">
   

   
   
   


   

   
   
   


Due to the general terrible XML schema validation performance the project 
has an XSD-root with constraint validation and a separate file without 
constraint validation.


The syntax validation performance alone within libxml2 in my perspective is 
quite good. It takes about 14s to load the entire XSD, 9s to load a file of 
about 400MB, and 3 seconds of validation. Xerces-c would take 50s total.


The main problem that I am trying to address is constraint validation 
itself, which takes unreasonably long. I think improving this would help 
the general public, not only this project. Exclusively adding the 
illustrated example increases that 3 seconds of syntax validation to 186 
seconds.


If we peak into the document using xmllint --shell:
setns netex=http://www.netex.org.uk/netex
xpath count(.//netex:StopPointInJourneyPattern)
Object is a number : 39509

Within 2 seconds the following is evaluated;
xpath count(.//netex:StopPointInJourneyPatternRef | 
.//netex:FarePointInPatternRef | .//netex:FromPointInPatternRef | 
.//netex:ToPointInPatternRef  | .//netex:StartPointInPatternRef  | 
.//netex:EndPointInPatternRef)

Object is a number : 0


I would like to ask some naive questions considering the schema validation.

1) Considering there is no ref to match a key, why would the refer be 
evaluated at all? By removing the key/keyref-pair manually the validation 
time is reduced to 77s. Still quite high for merely evaluating uniqueness. 
For the unique constraint this seems to be in effect, no elements, does not 
cause overhead.


2) Limiting he uniqueness constraint to merely @id, the validation time is 
reduced to 37s.


3) Considering my count() performance above (within a second) querying the 
document seems not really to be an issue. Sure, it queries the entire tree 
for a single object, but one could argue that the xpath result would be a 
one time effort, or an index could be placed on all to be queried elements. 
For example, each xsd:key would a hash list, all keyrefs could be queried 
for on the hash list.


4) Changing the xpath evaluation to below, increases the evaluation time to 
1 minute and 20 seconds. An valid expression, without any result, reduces 
the computation time to 3 seconds. I find it interesting that a full path 
xpath expression (including root) seems to work faster in the xmllint 
shell, but performance worse as selector.


netex:dataObjects/netex:CompositeFrame/netex:frames/netex:ServiceFrame/netex:journeyPatterns/netex:ServiceJourneyPattern/netex:pointsInSequence/netex:StopPointInJourneyPattern

5) Considering the constraint validation is read-only, would it be possible 
to parallelize them using multithreading?



The top of an oprofile trace for the entire constraint checking document 
looks like this:


CPU: AMD64 generic, speed 2000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (CPU Clocks not Halted) with a unit mask of 
0x00 (No unit mask) count 10

samples  %image name   symbol name
16300585 52.9514  libxml2.so.2.9.10xmlStreamPushInternal
5715003  18.5648  libxml2.so.2.9.10xmlStreamPop
1547103   5.0257  libxml2.so.2.9.10xmlSchemaXPathEvaluate
1341334   4.3572  libxml2.so.2.9.10xmlSchemaXPathProcessHistory
7769142.5238  libxml2.so.2.9.10xmlStrchr
6369482.0691  libxml2.so.2.9.10xmlSchemaValidatorPopElem
5858831.9032  libxml2.so.2.9.10xmlStrEqual
5597141.8182  libxml2.so.2.9.10xmlStreamPushAttr
3695501.2005  libxml2.so.2.9.10xmlHashLookup3
2953170.9593  libxml2.so.2.9.10__xmlRaiseError
2952510.9591  libxml2.so.2.9.10xmlSchemaXPathPop
2604830.8462  libxml2.so.2.9.10xmlStreamPush
1249480.4059  libxml2.so.2.9.10xmlStrlen
1145420.3721  libxml2.so.2.9.10xmlFACompareAtoms
98775 0.3209  libxml2.so.2.9.10xmlFAComputesDeterminism
98228 0.3191  libxml2.so.2.9.10xmlSchemaVAttributesComplex
90614 0.2944  libxml2.so.2.9.10xmlRegStrEqualWildcard
81907 0.2661  libc-2.32.so malloc_consolidate
81235 0.2639  libxml2.so.2.9.10xmlFARecurseDeterminism
62143 0.2019  libxml2.so.2.9.10xmlStrdup
60806