Re: [basex-talk] BaseX and validating the entire database

2019-12-17 Thread ERRINGTON Luke
Hi Daniel,

I think in the example provided that the XQuery looks shorter, but I think that 
if I expanded the Schematron definition to include more rules/requirements, 
that it would soon become briefer/terser. Also, it appears that the XQuery may 
have calls in it such as db:open() that require some knowledge of the database 
name, for instance, whereas the Schematron definition should be independent of 
that and thus more transferable between databases and even toolsets that are 
outside of a database.

However, you bring up a good point about speed, which I think that Christian 
has expanded upon.

I’ve just done some testing with the Schematron project referenced in the BaseX 
documentation in [1] - https://github.com/Schematron/schematron-basex. This 
appears to be implemented solely within terms of XSLT as well, so I’m not sure 
whether this or SchXslt is better – except that I am having problems getting it 
working as it involves several transformations and the produce results that 
don’t parse as valid XML (and thus can’t be used as input into the next 
transformation). I might try SchXslt.

Thanks,
Luke

[1] http://docs.basex.org/wiki/Validation_Module

From: Zimmel, Daniel 
Sent: Friday, 13 December 2019 8:14 PM
To: 'Hans-Juergen Rennau' ; Christian Grün 
; ERRINGTON Luke 
Cc: basex-talk@mailman.uni-konstanz.de
Subject: AW: [basex-talk] BaseX and validating the entire database


I would second that using Schematron here seems more complicated than actually 
writing the code in XQuery; it is even shorter.

We do this kind of checks in XQuery all the time, similar to the examples below.

Schema validation can also be quite slow when compared to optimized queries in 
XQuery/XPath.



Having said that, Schematron validation does work seamlessly with BaseX, but as 
far as I know it is not possible to pass external parameters to a schematron 
file.

So you would have to write your Schematron code in an XQuery variable anyway or 
try to insert dynamically (which is possible but does not sound very robust).

For document (not consistency) checks we use the SchXslt implementation which 
does an excellent job (https://github.com/schxslt/schxslt) because the module 
implementation linked on the BaseX wiki is still XSLT 1.0-only (and 1.0 support 
was temporarily dropped in Saxon 9.8). There is also a BaseX module ready to 
use in SchXslt.


Daniel

Von: Hans-Juergen Rennau mailto:hren...@yahoo.de>>
Gesendet: Freitag, 13. Dezember 2019 10:11
An: Christian Grün 
mailto:christian.gr...@gmail.com>>; ERRINGTON Luke 
mailto:luke.erring...@sydac.com>>
Cc: 
basex-talk@mailman.uni-konstanz.de
Betreff: Re: [basex-talk] BaseX and validating the entire database

Hi Luke, I would like to emphasize (or simply remind you) of two key features 
of XPath (and XML technology in general). The FIRST one is that treating the 
information in a single document or in a collection of documents or a 
collection of document fragments is identical. So, for example, $data//foo 
works regardless of whether $data is one document, or a collection of 
documents, or a single element extracted from some document, or a collection of 
elements extracted from multiple documents or even from a mixture of documents 
exposed by a database, the file system and REST service responses etc. 
Therefore collecting documents into a single document prior to processing is 
(according to my opinion) somewhat against the grain of what XML technology 
excels in accomplishing.

The SECOND point is that XPath has been specified with mathematical precision, 
so I cannot imagine being more precise and concise when it comes to defining 
*rules*. (That XPath expressions cannot easily replace a grammar is a different 
matter, of course.)

And finally - I would not overemphasize the importance of using schematron, as 
equivalent validation functionality is fairly easy to implement just using 
XQuery/XPath: it is the XPath language what is the engine and heartbeat of it 
all, it is a secondary question whether one uses the schematron framework, 
ingenious and handy though it is for typical single document checks.

Cheers,
Hans

Am Freitag, 13. Dezember 2019, 07:53:48 MEZ hat ERRINGTON Luke 
mailto:luke.erring...@sydac.com>> Folgendes 
geschrieben:


Hi Christian,

Thank you for your time in preparing your response and examples. You describe 
the approach that I thought would be necessary if we couldn't get some sort of 
schema validation to work. Unfortunately the specification of the validation 
requirements in XQuery code is not as clean, clear or minimal as might be 
desired.

It would be nice to have some sort of pre-commit hook for validating 
modifications to the database so that we are not restricted to only allowing 
modifications through XQuery. It looks as though this is the point of 
https://github.com/BaseXdb/basex/issues/1082, 
 but it looks as though that 
is on 

Re: [basex-talk] BaseX and validating the entire database

2019-12-17 Thread ERRINGTON Luke
Thanks Hans,

I understand your points, which are in part what prompted my question – since 
XPath can be applied to a collection of documents, and Schematron expresses 
rules in terms of XPath, then why can’t those rules be applied to a collection 
of documents? The answer appears to be because the Schematron implementation 
uses XSLT, and it appears to me that that only applies to a single document.

As much as an approach using XQuery may be the favoured option in this mailing 
list, I can guarantee that when I present to the rest of the company a solution 
using BaseX and tackle the issue of referential integrity that I would receive 
a more favourable response if I could present a ‘schema’, or a set of 
validation rules, in their simplest form, without them appearing to be embedded 
in code. (This would not only more minimal, but also more aligned with how 
foreign keys are defined in a RDBMS - as statements/declarations.)

Thanks again,
Luke

From: Hans-Juergen Rennau 
Sent: Friday, 13 December 2019 7:41 PM
To: Christian Grün ; ERRINGTON Luke 

Cc: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] BaseX and validating the entire database

Hi Luke, I would like to emphasize (or simply remind you) of two key features 
of XPath (and XML technology in general). The FIRST one is that treating the 
information in a single document or in a collection of documents or a 
collection of document fragments is identical. So, for example, $data//foo 
works regardless of whether $data is one document, or a collection of 
documents, or a single element extracted from some document, or a collection of 
elements extracted from multiple documents or even from a mixture of documents 
exposed by a database, the file system and REST service responses etc. 
Therefore collecting documents into a single document prior to processing is 
(according to my opinion) somewhat against the grain of what XML technology 
excels in accomplishing.

The SECOND point is that XPath has been specified with mathematical precision, 
so I cannot imagine being more precise and concise when it comes to defining 
*rules*. (That XPath expressions cannot easily replace a grammar is a different 
matter, of course.)

And finally - I would not overemphasize the importance of using schematron, as 
equivalent validation functionality is fairly easy to implement just using 
XQuery/XPath: it is the XPath language what is the engine and heartbeat of it 
all, it is a secondary question whether one uses the schematron framework, 
ingenious and handy though it is for typical single document checks.

Cheers,
Hans

Am Freitag, 13. Dezember 2019, 07:53:48 MEZ hat ERRINGTON Luke 
mailto:luke.erring...@sydac.com>> Folgendes 
geschrieben:


Hi Christian,

Thank you for your time in preparing your response and examples. You describe 
the approach that I thought would be necessary if we couldn't get some sort of 
schema validation to work. Unfortunately the specification of the validation 
requirements in XQuery code is not as clean, clear or minimal as might be 
desired.

It would be nice to have some sort of pre-commit hook for validating 
modifications to the database so that we are not restricted to only allowing 
modifications through XQuery. It looks as though this is the point of 
https://github.com/BaseXdb/basex/issues/1082, 
 but it looks as though that 
is on hold, after some significant discussion.

Presumably I could achieve schema validation by having the entire data set 
inside one document, but that would lose the benefits of collections, and 
having the data arranged similar to a file system, so ... I was hoping that I 
could define a Schematron rule something like this (untested, because I'm 
struggling to get Schematron working at the moment - content is not allowed in 
prolog):




Trying to map invalid 
object id
Trying to map invalid 
object id




This is relatively minimal and expressive. It seems to work just by XPath, so 
all I need is //object/@id to find the object IDs present in all documents, not 
just this one. But, when I use //object/@id as a path in BaseX it does just 
that! It returns all of the object IDs, in all of the documents - so maybe this 
schema can be used across all documents at once! That would be fantastic!

Of course, in practice I am not sure if this can be done, and I am pretty new 
to all of this. I see that currently schematron::validate requires a node as an 
input. I presume that db:open() will give me a sequence of document-nodes. What 
I presume would work is if I could turn this sequence into a single 
document-node, somehow. I am not sure if this can be done easily, or 
efficiently, in XQuery, or whether it would be easier to implement it within 
BaseX's implementation of db:open, or whether this is not really feasible at 
all ...

(With that working a similar line of thought would apply to schema 

[basex-talk] How many QueryModule instances can be created?

2019-12-17 Thread Johannes Echterhoff
Hello,
We are struggling with a situation where we have built a custom Java 
QueryModule (as described in http://docs.basex.org/wiki/Java_Bindings) - let's 
call it M1, and that module is imported by two other XQuery modules (let's call 
them M2 and M3). When executing an XQuery that imports M1 and the other XQuery 
modules M2 and M3 within the BaseX GUI, multiple different instances of M1 are 
created. That is an issue, because we initialise the module M1 with some 
information (a custom geometry index) before calling any other function of M1. 
This initialization shall only occur once, because it is quite time-consuming. 
Also, the information from this initialization procedure must be available 
whenever the (non-initialization) functions of M1 are called - whether in the 
XQuery or the other modules M2 and M3.
The situation is as follows:

* testquery.xq imports the custom module M1

* testquery.xq also imports the XQuery modules M2 and M3, which are 
stored in other XQuery files; M2 and M3 both import M1

* the custom module is initialized in testquery.xq - not in M2 and M3

* non-initialization functions of M1 are called in testquery.xq, as 
well as by functions declared in M2 and M3 - but only after initialization 
functions of M1 have been called in testquery.xq
We get exceptions when functions of M1 are called by functions declared in M2 
and M3, apparently because multiple instances of M1 have been created by BaseX 
(tested with the BaseX GUI), and only one of them has been initialized (in 
testquery.xq).
After this lengthy introduction of the issue at hand, here are my questions:

* Is it expected behavior that multiple instances of a Java QueryModule 
(M1, in my scenario) may be created and used during an execution of a query 
scenario like the one described above - or in general? Or should there only 
ever be a single such instance, regardless of how many times the module is 
imported?

o   Note: The functions used to initialize M1 are non-deterministic and 
declared as such. Not entirely sure if that makes a difference regarding how 
many times M1 would be created.

* I have the same questions for the case that the Java QueryModule M1 
was packaged in a XAR (as described in 
http://docs.basex.org/wiki/Repository#EXPath_Packaging). Would such a packaging 
approach actually make any difference?
Apologies that I do not have a small, self-contained example project to 
demonstrate this. I hope that I explained the issue with sufficient detail and 
clarity. If not, just let me know.
Best regards,
Johannes
P.S.: If you have suggestions for a better approach of handling such a 
scenario, where a QueryModule must be initialized before it can be used, and 
there shall only be a single instance of this module within the execution of an 
XQuery, let me know.



Re: [basex-talk] Huge No of XML files.

2019-12-17 Thread Liam R. E. Quin
On Tue, 2019-12-17 at 11:48 +0530, Sreenivasulu Yadavalli wrote:
> 
> Every day we are moving collections around 55k to 60k no of xml files
> large
> account.


Here, i just created a BaseX database with 80,000 XML files. It took
under one minute on the Linux desktop system i use.

>  Its taking more than 18 hours.
This make no sense. How much memory do you have on the computer?

What exactly do you mean by moving collections around?

Are you taking a database with 100 million documents and renaming
50,000 of them?

What operations exactly are slow?

Liam

-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org