Re: [basex-talk] Basex Inner Workings

2017-09-18 Thread Anastasiou A .
Hello Fabrice

Given:


[a number][a string]
…
   [a number][a string]


The size is ~5m `item`. (Depending on the query, we are talking about a few 
million items)

If I don’t add any external additional structure, which here is defined by the 
`item`, `items` elements, then the “unformatted” output is generated in under 
2sec.

[a number][a string]
…
[a number][a string]

Again, that would be a few million items. Queries are exactly the same apart 
from the addition of element items{ for… for… return element item {…}}

The problem with this second representation is that you don’t really know where 
tags from one item of the original database begin and end, this is why I want 
to enclose them further.

All the best




All the best



From: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice 
ETANCHAUD
Sent: 18 September 2017 15:56
To: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Basex Inner Workings

Hi Athanasios,

Could you please give us a idea of your resulting document size after 1.5 
minutes of BaseX time ?

Best regards,
Fabrice

De : 
basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de>
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A.
Envoyé : lundi 18 septembre 2017 14:47
À : 'Graydon Saunders'; 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Objet : Re: [basex-talk] Basex Inner Workings

Hello

Many thanks, Dirk, Fabrice and Graydon.

I was going to look up ways of enabling the server to run as fast as possible 
anyway later on, so it is always good to know how is BaseX “thinking”.

I can see what you mean Graydon. This is a simple nested `for` to denormalise 
some of the structures of the XML file, where “some” is defined by
an XPath expression.

As far as I can tell, there is nothing being re-evaluated repeatedly within the 
inner loop that could be brought outside.

I have gone through the dot plans of the quickest and slowest versions of the 
query and the only thing they differ is in the addition of the CElems.

The “scaling” of the timings, in case it helps, is as follows:

Simple query, returning elements: 1100-1500 ms

Adding an `element` to what is returned just by the innermost `for`: 7500-9311 
ms
This means:
For…
   For….
Return element item{someElement|someOtherElement}

Adding an `element` to the whole block (no `element` to the innermost 
`for`):49000-67000ms
This means:
Element Items{
For…
For…
 Return someElement|someOtherElement
}

Adding an `element` to both places: 5-8ms
This means:
Element Items{
For…
For …
Return element Item {someElement|someOtherElement}
}


I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s going to 
be a bit annoying.

All the best







From: Graydon Saunders [mailto:graydon...@gmail.com]
Sent: 15 September 2017 17:04
To: Anastasiou A.; 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Basex Inner Workings

As a follow-on to Dirk, it's amazing how much of a performance difference it 
can make to use typed variables when you're constructing something for output.  
(So far as I can tell, variables declarations function as an "optimize this!" 
flag for BaseX.)

If you get good performance when you're just throwing the resulting nodes and 
lose it massively by adding structure, as you relate up there somewhere are:
The change was to go from simply returning the nodes themselves with a `return 
thisnode | thatnode |theothernode` to a "formatted" document that has an outer 
 with a number of `return 
{thisNode|thatNode|theOtherNode}` inside it.

my immediate thought was "it's querying the same thing multiple times".

Most programming languages it's good practice to not create variables when you 
can inline.  XQuery does not appear to be one of those languages. :)  I try to 
think of this as "how can I make things easy for the optimizer?"

-- Graydon

On Fri, Sep 15, 2017 at 11:55 AM, Kirsten, Dirk 
<dirk.kirs...@senacor.com<mailto:dirk.kirs...@senacor.com>> wrote:
Hello Athanasios,

I think you should really check the actual query plan which is executed. If you 
have such a huge spike in performance surely they processor will be executing 
it differently. I don't think looking into file access patterns BaseX 
internally uses is very useful for an end user. You should let BaseX handle 
that (but of course, if you find better/more efficient ways I am sure 
Christian' gladly accepts Pull Requests). But the pattern you describe sounds 
very much excepted, so reads if you open databases seem logical and short write 
operations are also expected when just reading a database, because e.g. BaseX 
has to lock the databases.

So I think it would be more useful to look into the query

Re: [basex-talk] Basex Inner Workings

2017-09-18 Thread Anastasiou A .
Must be the day today, sorry, please see below:

No, but it did not make any difference.

But I will tell you what did make a difference, forcing everything to be a 
string and hard coding the names of the tags. That’s a ~3-4 sec query to return 
~5 million items.

I was led to this by what you said about computed elements because it makes 
perfect sense if BaseX has to create the document it returns, in memory, as a 
“proper” XML tree data structure.

I am not particularly jumping up and down about this but it works for the 
moment for such a simple use case. It’s not best practice though so I would be 
more inclined to use the right way of speeding this query up if possible.

By the way, there are now “computed” (in the sense of derived) fields in this 
query, in case you meant it that way.

All the best






From: Anastasiou A.
Sent: 18 September 2017 14:29
To: 'Graydon Saunders'; basex-talk@mailman.uni-konstanz.de
Subject: RE: [basex-talk] Basex Inner Workings

No, but it did not make any difference.

But I will tell you what did make a difference, forcing everything to be a 
string and hard coding the names of the tags. That’s a ~3-4 sec query to return 
~5 million items.

I was led to this by what you said about computed elements because it makes 
perfect sense if BaseX
has to create the “document” it returns, in memory, as a prop




From: 
basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de>
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Graydon 
Saunders
Sent: 18 September 2017 14:01
To: 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Basex Inner Workings

Sorry for the fumble-fingers; let me try that again.

Have you tried creating literal elements?

Computed elements have overhead; it's presumptively akin to why you don't want 
to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable might be 
anything and needs a whole document node to exist in, and this is expensive).  
In this case, I'd be darkly suspicious the computed elements are computing 
their contents every time.

I'd be trying
for ...
let $elem1 as element() := ...
let $elem2 as element() := ...

{$elem1,$elem2}

instead of the computed element.

The optimizer is really good in BaseX but it's also really complicated; the 
local maxima can be quite narrow.


On Mon, Sep 18, 2017 at 8:58 AM, Graydon Saunders 
<graydon...@gmail.com<mailto:graydon...@gmail.com>> wrote:
Have you tried creating literal elements?

Computed elements have overhead; it's presumptively akin to why you don't want 
to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable might be 
anything and needs a whole document node to exist in, and this is expensive).  
In this case, I'd be darkly suspicious the computed elements are computing 
their contents every time.

I'd be trying
for ...
let $elem1 as element() := ...
let $elem2 as element() := ...


On Mon, Sep 18, 2017 at 8:46 AM, Anastasiou A. 
<a.anastas...@swansea.ac.uk<mailto:a.anastas...@swansea.ac.uk>> wrote:
Hello

Many thanks, Dirk, Fabrice and Graydon.

I was going to look up ways of enabling the server to run as fast as possible 
anyway later on, so it is always good to know how is BaseX “thinking”.

I can see what you mean Graydon. This is a simple nested `for` to denormalise 
some of the structures of the XML file, where “some” is defined by
an XPath expression.

As far as I can tell, there is nothing being re-evaluated repeatedly within the 
inner loop that could be brought outside.

I have gone through the dot plans of the quickest and slowest versions of the 
query and the only thing they differ is in the addition of the CElems.

The “scaling” of the timings, in case it helps, is as follows:

Simple query, returning elements: 1100-1500 ms

Adding an `element` to what is returned just by the innermost `for`: 7500-9311 
ms
This means:
For…
   For….
Return element item{someElement|someOtherElement}

Adding an `element` to the whole block (no `element` to the innermost 
`for`):49000-67000ms
This means:
Element Items{
For…
For…
 Return someElement|someOtherElement
}

Adding an `element` to both places: 5-8ms
This means:
Element Items{
For…
For …
Return element Item {someElement|someOtherElement}
}


I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s going to 
be a bit annoying.

All the best







From: Graydon Saunders 
[mailto:graydon...@gmail.com<mailto:graydon...@gmail.com>]
Sent: 15 September 2017 17:04
To: Anastasiou A.; 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Basex Inner Workings

As a follow-on to Dirk, it's amazing how much of a performance difference it 
can make to use typed variables when you're constructing something for output.  
(So far as I can tell, variables de

Re: [basex-talk] Basex Inner Workings

2017-09-18 Thread Anastasiou A .
No, but it did not make any difference.

But I will tell you what did make a difference, forcing everything to be a 
string and hard coding the names of the tags. That’s a ~3-4 sec query to return 
~5 million items.

I was led to this by what you said about computed elements because it makes 
perfect sense if BaseX
has to create the “document” it returns, in memory, as a prop




From: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Graydon 
Saunders
Sent: 18 September 2017 14:01
To: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Basex Inner Workings

Sorry for the fumble-fingers; let me try that again.

Have you tried creating literal elements?

Computed elements have overhead; it's presumptively akin to why you don't want 
to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable might be 
anything and needs a whole document node to exist in, and this is expensive).  
In this case, I'd be darkly suspicious the computed elements are computing 
their contents every time.

I'd be trying
for ...
let $elem1 as element() := ...
let $elem2 as element() := ...

{$elem1,$elem2}

instead of the computed element.

The optimizer is really good in BaseX but it's also really complicated; the 
local maxima can be quite narrow.


On Mon, Sep 18, 2017 at 8:58 AM, Graydon Saunders 
<graydon...@gmail.com<mailto:graydon...@gmail.com>> wrote:
Have you tried creating literal elements?

Computed elements have overhead; it's presumptively akin to why you don't want 
to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable might be 
anything and needs a whole document node to exist in, and this is expensive).  
In this case, I'd be darkly suspicious the computed elements are computing 
their contents every time.

I'd be trying
for ...
let $elem1 as element() := ...
let $elem2 as element() := ...


On Mon, Sep 18, 2017 at 8:46 AM, Anastasiou A. 
<a.anastas...@swansea.ac.uk<mailto:a.anastas...@swansea.ac.uk>> wrote:
Hello

Many thanks, Dirk, Fabrice and Graydon.

I was going to look up ways of enabling the server to run as fast as possible 
anyway later on, so it is always good to know how is BaseX “thinking”.

I can see what you mean Graydon. This is a simple nested `for` to denormalise 
some of the structures of the XML file, where “some” is defined by
an XPath expression.

As far as I can tell, there is nothing being re-evaluated repeatedly within the 
inner loop that could be brought outside.

I have gone through the dot plans of the quickest and slowest versions of the 
query and the only thing they differ is in the addition of the CElems.

The “scaling” of the timings, in case it helps, is as follows:

Simple query, returning elements: 1100-1500 ms

Adding an `element` to what is returned just by the innermost `for`: 7500-9311 
ms
This means:
For…
   For….
Return element item{someElement|someOtherElement}

Adding an `element` to the whole block (no `element` to the innermost 
`for`):49000-67000ms
This means:
Element Items{
For…
For…
 Return someElement|someOtherElement
}

Adding an `element` to both places: 5-8ms
This means:
Element Items{
For…
For …
Return element Item {someElement|someOtherElement}
}


I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s going to 
be a bit annoying.

All the best







From: Graydon Saunders 
[mailto:graydon...@gmail.com<mailto:graydon...@gmail.com>]
Sent: 15 September 2017 17:04
To: Anastasiou A.; 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Basex Inner Workings

As a follow-on to Dirk, it's amazing how much of a performance difference it 
can make to use typed variables when you're constructing something for output.  
(So far as I can tell, variables declarations function as an "optimize this!" 
flag for BaseX.)

If you get good performance when you're just throwing the resulting nodes and 
lose it massively by adding structure, as you relate up there somewhere are:
The change was to go from simply returning the nodes themselves with a `return 
thisnode | thatnode |theothernode` to a "formatted" document that has an outer 
 with a number of `return 
{thisNode|thatNode|theOtherNode}` inside it.

my immediate thought was "it's querying the same thing multiple times".

Most programming languages it's good practice to not create variables when you 
can inline.  XQuery does not appear to be one of those languages. :)  I try to 
think of this as "how can I make things easy for the optimizer?"

-- Graydon

On Fri, Sep 15, 2017 at 11:55 AM, Kirsten, Dirk 
<dirk.kirs...@senacor.com<mailto:dirk.kirs...@senacor.com>> wrote:
Hello Athanasios,

I think you should really check the actual query plan which is executed. If you 
have such a huge spike in performance surely they processor 

Re: [basex-talk] Basex Inner Workings

2017-09-18 Thread Anastasiou A .
Hello

Many thanks, Dirk, Fabrice and Graydon.

I was going to look up ways of enabling the server to run as fast as possible 
anyway later on, so it is always good to know how is BaseX “thinking”.

I can see what you mean Graydon. This is a simple nested `for` to denormalise 
some of the structures of the XML file, where “some” is defined by
an XPath expression.

As far as I can tell, there is nothing being re-evaluated repeatedly within the 
inner loop that could be brought outside.

I have gone through the dot plans of the quickest and slowest versions of the 
query and the only thing they differ is in the addition of the CElems.

The “scaling” of the timings, in case it helps, is as follows:

Simple query, returning elements: 1100-1500 ms

Adding an `element` to what is returned just by the innermost `for`: 7500-9311 
ms
This means:
For…
   For….
Return element item{someElement|someOtherElement}

Adding an `element` to the whole block (no `element` to the innermost 
`for`):49000-67000ms
This means:
Element Items{
For…
For…
 Return someElement|someOtherElement
}

Adding an `element` to both places: 5-8ms
This means:
Element Items{
For…
For …
Return element Item {someElement|someOtherElement}
}


I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s going to 
be a bit annoying.

All the best







From: Graydon Saunders [mailto:graydon...@gmail.com]
Sent: 15 September 2017 17:04
To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Basex Inner Workings

As a follow-on to Dirk, it's amazing how much of a performance difference it 
can make to use typed variables when you're constructing something for output.  
(So far as I can tell, variables declarations function as an "optimize this!" 
flag for BaseX.)

If you get good performance when you're just throwing the resulting nodes and 
lose it massively by adding structure, as you relate up there somewhere are:
The change was to go from simply returning the nodes themselves with a `return 
thisnode | thatnode |theothernode` to a "formatted" document that has an outer 
 with a number of `return 
{thisNode|thatNode|theOtherNode}` inside it.

my immediate thought was "it's querying the same thing multiple times".

Most programming languages it's good practice to not create variables when you 
can inline.  XQuery does not appear to be one of those languages. :)  I try to 
think of this as "how can I make things easy for the optimizer?"

-- Graydon

On Fri, Sep 15, 2017 at 11:55 AM, Kirsten, Dirk 
<dirk.kirs...@senacor.com<mailto:dirk.kirs...@senacor.com>> wrote:
Hello Athanasios,

I think you should really check the actual query plan which is executed. If you 
have such a huge spike in performance surely they processor will be executing 
it differently. I don't think looking into file access patterns BaseX 
internally uses is very useful for an end user. You should let BaseX handle 
that (but of course, if you find better/more efficient ways I am sure 
Christian' gladly accepts Pull Requests). But the pattern you describe sounds 
very much excepted, so reads if you open databases seem logical and short write 
operations are also expected when just reading a database, because e.g. BaseX 
has to lock the databases.

So I think it would be more useful to look into the query plan. Of course you 
are more than welcome to ask about what is going on there on this list. I would 
expect that because of your rewrite maybe some indexes are not applied anymore 
(or if your rewrite is simply very big that most of the time is spent 
serializing the data).

Cheers
Dirk


Senacor Technologies Aktiengesellschaft - Sitz: Eschborn - Amtsgericht 
Frankfurt am Main - Reg.-Nr.: HRB 105546
Vorstand: Matthias Tomann, Marcus Purzer - Aufsichtsratsvorsitzender: Daniel 
Grözinger

-Ursprüngliche Nachricht-
Von: 
basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de>
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] Im Auftrag von Fabrice 
ETANCHAUD
Gesendet: Freitag, 15. September 2017 17:35
An: 'Anastasiou A.' 
<a.anastas...@swansea.ac.uk<mailto:a.anastas...@swansea.ac.uk>>; 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Betreff: Re: [basex-talk] Basex Inner Workings


You can find the time spent in each step in the query info bar graph.

If you are looking for the schema and the facets of your dataset, you should 
have a look at the index module, and for sure at index:facets()

Best regards,
Fabrice

-Message d'origine-
De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk]
Envoyé : vendredi 15 septembre 2017 17:23 À : Fabrice ETANCHAUD; 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Objet : RE: Basex Inner Workings

Thank you Fabrice. I understand.

I have not tried querying from the

Re: [basex-talk] Basex Inner Workings

2017-09-15 Thread Anastasiou A .
Thank you Fabrice. I understand.

I have not tried querying from the command prompt or sending the output to a 
file directly, which I could also work with. But, my understanding is that the 
time we are being quoted by the gui is the DB time, not taking into account the 
time it takes for the list to be pushed into whatever data structures the list 
boxes might be supporting (?).

I am trying to get a better understanding of the dataset at the moment and I 
have short and long queries which depending on the results I get from this step 
could be 
optimised further.

All the best

-Original Message-
From: Fabrice ETANCHAUD [mailto:fetanch...@pch.cerfrance.fr] 
Sent: 15 September 2017 16:17
To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de
Subject: RE: Basex Inner Workings

I understand that you are reformatting a lot of data, aren't you ?
I will have only little advice, because this is not my use case.

>From what I know, resulting document will be materialized entirely in memory 
>before presentation or export.
You should export your results to disk, in order not to lose time in BaseXGUI 
rendering.

To reformat very big amounts of data, you might have a look at saxon streaming 
features (not in the free version).

But usually, big results are not requested frequently.

Best regards,
Fabrice

-Message d'origine-
De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk]
Envoyé : vendredi 15 septembre 2017 16:39 À : Fabrice ETANCHAUD; 
basex-talk@mailman.uni-konstanz.de
Objet : RE: Basex Inner Workings

Hello Fabrice

Yes, I am having a query which jumped from ~1500 ms to about a minute with a 
tiny little change...

The DB is about 2GB and it is my test set before putting the query to work on 
the full dataset.

The change was to go from simply returning the nodes themselves with a `return 
thisnode | thatnode |theothernode` to a "formatted" document that has an outer 
 with a number of `return 
{thisNode|thatNode|theOtherNode}` inside it.

I understand that the new query might be creating some new entities but 
compared to the element content, these few extra characters are not THAT many 
more.

The query jumps from ~1500 ms when using plain XML, to ~55000ms with the 
addition of the collection, item nodes, to ~57000ms with the addition of CSV 
exporting via the CSV module. These are "informal average" values. So, I have 
not run the same query a few times and then obtain the average, but that's the 
sort of vicinity I have seen numbers in from the times I have run the queries 
so far.

The database itself is "static", there are no update/insert transactions at the 
moment, the only thing that I am trying to do is extract some data in a 
different format from it.

I have Text, Attribute and Token indexes on that database (optimised right 
after importing) but no further options enabled. I also have not experimented 
with the SPLITSIZE (?). I have 32GB of memory and it should be enough to handle 
this 2GB test dataset (?). I will have a go with DEBUG on.

Did you have to enable any additional options for indexes to work faster?

All the best





-Original Message-
From: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice 
ETANCHAUD
Sent: 15 September 2017 13:27
To: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Basex Inner Workings

Hi Athanasios,

Did you experience slow queries ?
Are you sure to use all the index features ?
Are these queries operational ones (direct access on a key value) or analytics ?

I never experienced slow queries, even on huge xml corpus (patent 
registrations), But this is at the cost of longer indexing times on updates.

Best regards,


-Message d'origine-
De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A.
Envoyé : vendredi 15 septembre 2017 14:01 À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] Basex Inner Workings

Hello everyone

Quick question: Is there any document / URL where I could find out more about 
how does Basex access the disk during its operation?

For example, are there any reads to be expected during executing a query?

Through iotop, I can see 3-4 processes reading during startup, then another 2, 
very briefly firing when opening the database and then during querying there 
are periodic writes (?) but of very brief duration.

I was wondering if there is anything that could be done from the point of view 
of the hardware to speed up queries (?) (except a more powerful machine at the 
moment)

All  the best
Athanasios Anastasiou


Re: [basex-talk] Basex Inner Workings

2017-09-15 Thread Anastasiou A .
Hello Fabrice

Yes, I am having a query which jumped from ~1500 ms to about a minute with a 
tiny little change...

The DB is about 2GB and it is my test set before putting the query to work on 
the full dataset.

The change was to go from simply returning the nodes themselves with a `return 
thisnode | thatnode |theothernode` to a "formatted" document that has an outer 
 with a number of `return 
{thisNode|thatNode|theOtherNode}` inside it.

I understand that the new query might be creating some new entities but 
compared to the element content, these few extra characters are not THAT many 
more.

The query jumps from ~1500 ms when using plain XML, to ~55000ms with the 
addition of the collection, item nodes, to ~57000ms with the addition of CSV 
exporting via the CSV module. These are "informal average" values. So, I have 
not run the same query a few times and then obtain the average, but that's the 
sort of vicinity I have seen numbers in from the times I have run the queries 
so far.

The database itself is "static", there are no update/insert transactions at the 
moment, the only thing that I am trying to do is extract some data in a 
different format from it.

I have Text, Attribute and Token indexes on that database (optimised right 
after importing) but no further options enabled. I also have not experimented 
with the SPLITSIZE (?). I have 32GB of memory and it should be enough to handle 
this 2GB test dataset (?). I will have a go with DEBUG on.

Did you have to enable any additional options for indexes to work faster?

All the best





-Original Message-
From: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice 
ETANCHAUD
Sent: 15 September 2017 13:27
To: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Basex Inner Workings

Hi Athanasios,

Did you experience slow queries ?
Are you sure to use all the index features ?
Are these queries operational ones (direct access on a key value) or analytics ?

I never experienced slow queries, even on huge xml corpus (patent 
registrations), But this is at the cost of longer indexing times on updates.

Best regards,


-Message d'origine-
De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A.
Envoyé : vendredi 15 septembre 2017 14:01 À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] Basex Inner Workings

Hello everyone

Quick question: Is there any document / URL where I could find out more about 
how does Basex access the disk during its operation?

For example, are there any reads to be expected during executing a query?

Through iotop, I can see 3-4 processes reading during startup, then another 2, 
very briefly firing when opening the database and then during querying there 
are periodic writes (?) but of very brief duration.

I was wondering if there is anything that could be done from the point of view 
of the hardware to speed up queries (?) (except a more powerful machine at the 
moment)

All  the best
Athanasios Anastasiou


Re: [basex-talk] Basex Inner Workings

2017-09-15 Thread Anastasiou A .
Hello Alexander

The thesis is a fantastic resource for getting to know a bit more about Basex's 
inner workings, thank you very much.

I had seen the storage_layout already but I was trying to understand if there 
is anything that can be done at the file system 
level. This was also because I read that parallel operations could result in 
patterns that cannot be handled by caching efficiently (which is a very good 
point anyway).

All the best




-Original Message-
From: Alexander Holupirek [mailto:a...@holupirek.de] 
Sent: 15 September 2017 13:56
To: Anastasiou A.
Cc: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Basex Inner Workings


> On 15. Sep 2017, at 14:00, Anastasiou A. <a.anastas...@swansea.ac.uk> wrote:
> Quick question: Is there any document / URL where I could find out more about 
> how does Basex access the disk during its operation?
> 
> For example, are there any reads to be expected during executing a query?

You can have a look at Christian's dissertation:

http://files.basex.org/publications/Gruen%20[2010],%20Storing%20and%20Querying%20Large%20XML%20Instances.pdf

That way you can at least get a picture of the inner organisation of the 
storage system and may deduce some access patterns?

http://docs.basex.org/wiki/Storage_Layout may help as well?



[basex-talk] Basex Inner Workings

2017-09-15 Thread Anastasiou A .
Hello everyone

Quick question: Is there any document / URL where I could find out more about 
how does Basex access the disk during its operation?

For example, are there any reads to be expected during executing a query?

Through iotop, I can see 3-4 processes reading during startup, then another 2, 
very briefly firing when opening the database and then 
during querying there are periodic writes (?) but of very brief duration.

I was wondering if there is anything that could be done from the point of view 
of the hardware to speed up queries (?) (except a more powerful machine at the 
moment)

All  the best
Athanasios Anastasiou


Re: [basex-talk] Possible Bug in BaseX 8.2.3 when importing XML (Was RE: A few general questions about BaseX)

2017-09-14 Thread Anastasiou A .
Hello Fabrice

That’s brilliant, thank you very much, I will keep it in mind for future 
reference.

No, I did not set the DEBUG and yes it was directory content.

Once I find some time, I am going to run the “offending” import again with DEBUG
and send some more information in case this is indeed a bug. But, I have to say,
it may be that the DB was hitting one of its natural limits, which is fine.

All the best



From: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice 
ETANCHAUD
Sent: 14 September 2017 09:26
To: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Possible Bug in BaseX 8.2.3 when importing XML (Was 
RE: A few general questions about BaseX)

Hi Athanasios,

Did you set the DEBUG option to get detailed information ?

Could you confirm you are creating a db from a directory content ?
If this is the case, as suggested, you should generate a command script to 
force the loading order, and use this script to load the data in forced order 
to detect where it fails.
You can easily create such a bxs file in xquery with a for file:list() loop.

This should look like :





myphysicalpath
myphysicalpath

..



Best regards,
Fabrice Etanchaud

De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk]
Envoyé : mercredi 13 septembre 2017 11:23
À : 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Cc : 'Alexander Holupirek'; 'Michael Seiferle'; Fabrice ETANCHAUD; 'Bridger 
Dyson-Smith'
Objet : Possible Bug in BaseX 8.2.3 when importing XML (Was RE: [basex-talk] A 
few general questions about BaseX)

Hello everyone

Many thanks to Alexander, Bridger, Fabrice, Michael for getting back to me with 
very detailed responses, these have been really helpful.

A few notes:


1)  The name is Athanasios :D. Sorry, just couldn’t help it, it seemed 
incredibly formal to be addressed via the surname in our communications.
Our mail server advertises the “Surname. Initial” pattern, so I can see where 
the confusion came from.

2)  I think that there is scope for adding some sort of “logging” to all 
actions of the server in general because I think I may have hit a bug but I 
cannot
provide any more illuminating comments. Here is what is happening:

a.  During import, I get an error that file somethingsomething140.xml has 
an incredibly long element that cannot be imported at line (blahblah). The 
whole process just dies there.

b.  This is a bug, because if I simply imported JUST the offending file 
itself, a single file database is created without any problems and I can query 
it and all. So, maybe, the error is caused because of the previous file OR 
because of the way the files are loaded. But I have absolutely no way of 
knowing the “load history” of the files or the exception that was caught or 
anything else. In fact, once you press “OK” in the error dialog box, any 
database files that have been created are lost. In addition to this, the XML 
files to import are enumerated in a random order. So, I had to run the import 
again and stay there looking at each one of the files loading, to witness that 
the system “breaks” after 254 files (which is suspiciously close to 256). None 
of the files around the vicinity of the offending file caused any problems, so 
this may be a more difficult to catch bug (but it is thrown with both the 
internal and external parsers). Following this, I created smaller databases 
with 250 XML files and then got “predictable” errors on running out of memory 
and not creating indexes which I can solve more easily.

3)  It’s good to know that I don’t need the original files because that’s a 
lot of space I can get rid of. Thank you.

4)  Seems like the ADDCACHE would have saved me some trouble here, many 
thanks for that, but of course, if you don’t know the file enumeration order, 
you are still stuck in not knowing which files have already been imported.

5)  Michael, logging won’t help with the internal import procedure, except 
of course if you were implying writing a quick script to do the import 
“manually”?

6)  Michael, the fork-join and “client connect” are really interesting and 
worth a try before I start connecting things together via Hadoop. Are these 
modules already available to BaseX? Do I simply import their namespace or is it 
not even needed?

Many thanks again.

All the best






From: Bridger Dyson-Smith [mailto:bdysonsm...@gmail.com]
Sent: 12 September 2017 16:53
To: Anastasiou A.
Cc: 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] A few general questions about BaseX

Hi Anastasiou,
Hopefully some of these answers are somewhat helpful.

On Tue, Sep 12, 2017 at 4:54 AM, Anastasiou A. 
<a.anastas...@swansea.ac.uk<mailto:a.anastas...@swansea.ac.uk>> wrote:
Hello everyone

I am trying to load BaseX with a large n

[basex-talk] Possible Bug in BaseX 8.2.3 when importing XML (Was RE: A few general questions about BaseX)

2017-09-13 Thread Anastasiou A .
Hello everyone

Many thanks to Alexander, Bridger, Fabrice, Michael for getting back to me with 
very detailed responses, these have been really helpful.

A few notes:


1)  The name is Athanasios :D. Sorry, just couldn’t help it, it seemed 
incredibly formal to be addressed via the surname in our communications.
Our mail server advertises the “Surname. Initial” pattern, so I can see where 
the confusion came from.


2)  I think that there is scope for adding some sort of “logging” to all 
actions of the server in general because I think I may have hit a bug but I 
cannot
provide any more illuminating comments. Here is what is happening:

a.  During import, I get an error that file somethingsomething140.xml has 
an incredibly long element that cannot be imported at line (blahblah). The 
whole process just dies there.

b.  This is a bug, because if I simply imported JUST the offending file 
itself, a single file database is created without any problems and I can query 
it and all. So, maybe, the error is caused because of the previous file OR 
because of the way the files are loaded. But I have absolutely no way of 
knowing the “load history” of the files or the exception that was caught or 
anything else. In fact, once you press “OK” in the error dialog box, any 
database files that have been created are lost. In addition to this, the XML 
files to import are enumerated in a random order. So, I had to run the import 
again and stay there looking at each one of the files loading, to witness that 
the system “breaks” after 254 files (which is suspiciously close to 256). None 
of the files around the vicinity of the offending file caused any problems, so 
this may be a more difficult to catch bug (but it is thrown with both the 
internal and external parsers). Following this, I created smaller databases 
with 250 XML files and then got “predictable” errors on running out of memory 
and not creating indexes which I can solve more easily.


3)  It’s good to know that I don’t need the original files because that’s a 
lot of space I can get rid of. Thank you.


4)  Seems like the ADDCACHE would have saved me some trouble here, many 
thanks for that, but of course, if you don’t know the file enumeration order, 
you are still stuck in not knowing which files have already been imported.


5)  Michael, logging won’t help with the internal import procedure, except 
of course if you were implying writing a quick script to do the import 
“manually”?


6)  Michael, the fork-join and “client connect” are really interesting and 
worth a try before I start connecting things together via Hadoop. Are these 
modules already available to BaseX? Do I simply import their namespace or is it 
not even needed?

Many thanks again.

All the best






From: Bridger Dyson-Smith [mailto:bdysonsm...@gmail.com]
Sent: 12 September 2017 16:53
To: Anastasiou A.
Cc: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] A few general questions about BaseX

Hi Anastasiou,
Hopefully some of these answers are somewhat helpful.

On Tue, Sep 12, 2017 at 4:54 AM, Anastasiou A. 
<a.anastas...@swansea.ac.uk<mailto:a.anastas...@swansea.ac.uk>> wrote:
Hello everyone

I am trying to load BaseX with a large number of XML files (~500), each one a 
few hundreds of MBs big.
BaseX fails with a message along the lines “This is too big for one database”.

Can I please ask:


1)  Are there any logs, beyond the DB logs? If yes, where can I find them?

a.  The reason I am asking is because once basexgui gives the message, 
there is no indication about the error.
Ideally, I would like to know if this is a limitation on memory amount or 
number of items (?).
I'm not sure how to enable more verbose logging with the GUI -- hopefully one 
of the devs or power users can weigh in on this.

2)  The parser options include reading XML files from archives, which is 
very convenient, but once the file has been
parsed, does BaseX require the “originals” for queries / returning results?
AFAIK, no it does not. BaseX will query and return results from the internal 
database(s).

3)  Is it possible to do federation with BaseX? In other words, let’s say I 
split a database in two large parts (as per #1),
is it possible to launch two baseX servers and then have them talk to each 
other so that ultimately I just query one of
them and get back unified results?
AFAIK, the preferred method is to split your files across many databases, then 
query multiple databases from a single expression[1]. Others will be able to 
speak to this better, but I don't think there's a straightforward way to run 
multiple BaseX servers in a single JVM.


All the best

Best,
Bridger

[1] http://docs.basex.org/wiki/Databases


[basex-talk] FW: A few general questions about BaseX

2017-09-12 Thread Anastasiou A .
I am sorry, turns out the error is probably due to malformed input in one of 
the files which I will have to look into, not BaseX, would however still 
appreciate some indication regarding the rest of the questions.

All the best



From: Anastasiou A.
Sent: 12 September 2017 09:54
To: basex-talk@mailman.uni-konstanz.de
Subject: A few general questions about BaseX

Hello everyone

I am trying to load BaseX with a large number of XML files (~500), each one a 
few hundreds of MBs big.
BaseX fails with a message along the lines "This is too big for one database".

Can I please ask:


1)  Are there any logs, beyond the DB logs? If yes, where can I find them?

a.  The reason I am asking is because once basexgui gives the message, 
there is no indication about the error.
Ideally, I would like to know if this is a limitation on memory amount or 
number of items (?).

2)  The parser options include reading XML files from archives, which is 
very convenient, but once the file has been
parsed, does BaseX require the "originals" for queries / returning results?

3)  Is it possible to do federation with BaseX? In other words, let's say I 
split a database in two large parts (as per #1),
is it possible to launch two baseX servers and then have them talk to each 
other so that ultimately I just query one of
them and get back unified results?

All the best


[basex-talk] A few general questions about BaseX

2017-09-12 Thread Anastasiou A .
Hello everyone

I am trying to load BaseX with a large number of XML files (~500), each one a 
few hundreds of MBs big.
BaseX fails with a message along the lines "This is too big for one database".

Can I please ask:


1)  Are there any logs, beyond the DB logs? If yes, where can I find them?

a.  The reason I am asking is because once basexgui gives the message, 
there is no indication about the error.
Ideally, I would like to know if this is a limitation on memory amount or 
number of items (?).


2)  The parser options include reading XML files from archives, which is 
very convenient, but once the file has been
parsed, does BaseX require the "originals" for queries / returning results?


3)  Is it possible to do federation with BaseX? In other words, let's say I 
split a database in two large parts (as per #1),
is it possible to launch two baseX servers and then have them talk to each 
other so that ultimately I just query one of
them and get back unified results?

All the best