Re: [basex-talk] Basex Inner Workings
Hello Fabrice Given: [a number][a string] … [a number][a string] The size is ~5m `item`. (Depending on the query, we are talking about a few million items) If I don’t add any external additional structure, which here is defined by the `item`, `items` elements, then the “unformatted” output is generated in under 2sec. [a number][a string] … [a number][a string] Again, that would be a few million items. Queries are exactly the same apart from the addition of element items{ for… for… return element item {…}} The problem with this second representation is that you don’t really know where tags from one item of the original database begin and end, this is why I want to enclose them further. All the best All the best From: basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice ETANCHAUD Sent: 18 September 2017 15:56 To: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Basex Inner Workings Hi Athanasios, Could you please give us a idea of your resulting document size after 1.5 minutes of BaseX time ? Best regards, Fabrice De : basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de> [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A. Envoyé : lundi 18 septembre 2017 14:47 À : 'Graydon Saunders'; basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Objet : Re: [basex-talk] Basex Inner Workings Hello Many thanks, Dirk, Fabrice and Graydon. I was going to look up ways of enabling the server to run as fast as possible anyway later on, so it is always good to know how is BaseX “thinking”. I can see what you mean Graydon. This is a simple nested `for` to denormalise some of the structures of the XML file, where “some” is defined by an XPath expression. As far as I can tell, there is nothing being re-evaluated repeatedly within the inner loop that could be brought outside. I have gone through the dot plans of the quickest and slowest versions of the query and the only thing they differ is in the addition of the CElems. The “scaling” of the timings, in case it helps, is as follows: Simple query, returning elements: 1100-1500 ms Adding an `element` to what is returned just by the innermost `for`: 7500-9311 ms This means: For… For…. Return element item{someElement|someOtherElement} Adding an `element` to the whole block (no `element` to the innermost `for`):49000-67000ms This means: Element Items{ For… For… Return someElement|someOtherElement } Adding an `element` to both places: 5-8ms This means: Element Items{ For… For … Return element Item {someElement|someOtherElement} } I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s going to be a bit annoying. All the best From: Graydon Saunders [mailto:graydon...@gmail.com] Sent: 15 September 2017 17:04 To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Basex Inner Workings As a follow-on to Dirk, it's amazing how much of a performance difference it can make to use typed variables when you're constructing something for output. (So far as I can tell, variables declarations function as an "optimize this!" flag for BaseX.) If you get good performance when you're just throwing the resulting nodes and lose it massively by adding structure, as you relate up there somewhere are: The change was to go from simply returning the nodes themselves with a `return thisnode | thatnode |theothernode` to a "formatted" document that has an outer with a number of `return {thisNode|thatNode|theOtherNode}` inside it. my immediate thought was "it's querying the same thing multiple times". Most programming languages it's good practice to not create variables when you can inline. XQuery does not appear to be one of those languages. :) I try to think of this as "how can I make things easy for the optimizer?" -- Graydon On Fri, Sep 15, 2017 at 11:55 AM, Kirsten, Dirk <dirk.kirs...@senacor.com<mailto:dirk.kirs...@senacor.com>> wrote: Hello Athanasios, I think you should really check the actual query plan which is executed. If you have such a huge spike in performance surely they processor will be executing it differently. I don't think looking into file access patterns BaseX internally uses is very useful for an end user. You should let BaseX handle that (but of course, if you find better/more efficient ways I am sure Christian' gladly accepts Pull Requests). But the pattern you describe sounds very much excepted, so reads if you open databases seem logical and short write operations are also expected when just reading a database, because e.g. BaseX has to lock the databases. So I think it would be more useful to look into the query
Re: [basex-talk] Basex Inner Workings
Hi Athanasios, Could you please give us a idea of your resulting document size after 1.5 minutes of BaseX time ? Best regards, Fabrice De : basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A. Envoyé : lundi 18 septembre 2017 14:47 À : 'Graydon Saunders'; basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Basex Inner Workings Hello Many thanks, Dirk, Fabrice and Graydon. I was going to look up ways of enabling the server to run as fast as possible anyway later on, so it is always good to know how is BaseX “thinking”. I can see what you mean Graydon. This is a simple nested `for` to denormalise some of the structures of the XML file, where “some” is defined by an XPath expression. As far as I can tell, there is nothing being re-evaluated repeatedly within the inner loop that could be brought outside. I have gone through the dot plans of the quickest and slowest versions of the query and the only thing they differ is in the addition of the CElems. The “scaling” of the timings, in case it helps, is as follows: Simple query, returning elements: 1100-1500 ms Adding an `element` to what is returned just by the innermost `for`: 7500-9311 ms This means: For… For…. Return element item{someElement|someOtherElement} Adding an `element` to the whole block (no `element` to the innermost `for`):49000-67000ms This means: Element Items{ For… For… Return someElement|someOtherElement } Adding an `element` to both places: 5-8ms This means: Element Items{ For… For … Return element Item {someElement|someOtherElement} } I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s going to be a bit annoying. All the best From: Graydon Saunders [mailto:graydon...@gmail.com] Sent: 15 September 2017 17:04 To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Basex Inner Workings As a follow-on to Dirk, it's amazing how much of a performance difference it can make to use typed variables when you're constructing something for output. (So far as I can tell, variables declarations function as an "optimize this!" flag for BaseX.) If you get good performance when you're just throwing the resulting nodes and lose it massively by adding structure, as you relate up there somewhere are: The change was to go from simply returning the nodes themselves with a `return thisnode | thatnode |theothernode` to a "formatted" document that has an outer with a number of `return {thisNode|thatNode|theOtherNode}` inside it. my immediate thought was "it's querying the same thing multiple times". Most programming languages it's good practice to not create variables when you can inline. XQuery does not appear to be one of those languages. :) I try to think of this as "how can I make things easy for the optimizer?" -- Graydon On Fri, Sep 15, 2017 at 11:55 AM, Kirsten, Dirk <dirk.kirs...@senacor.com<mailto:dirk.kirs...@senacor.com>> wrote: Hello Athanasios, I think you should really check the actual query plan which is executed. If you have such a huge spike in performance surely they processor will be executing it differently. I don't think looking into file access patterns BaseX internally uses is very useful for an end user. You should let BaseX handle that (but of course, if you find better/more efficient ways I am sure Christian' gladly accepts Pull Requests). But the pattern you describe sounds very much excepted, so reads if you open databases seem logical and short write operations are also expected when just reading a database, because e.g. BaseX has to lock the databases. So I think it would be more useful to look into the query plan. Of course you are more than welcome to ask about what is going on there on this list. I would expect that because of your rewrite maybe some indexes are not applied anymore (or if your rewrite is simply very big that most of the time is spent serializing the data). Cheers Dirk Senacor Technologies Aktiengesellschaft - Sitz: Eschborn - Amtsgericht Frankfurt am Main - Reg.-Nr.: HRB 105546 Vorstand: Matthias Tomann, Marcus Purzer - Aufsichtsratsvorsitzender: Daniel Grözinger -Ursprüngliche Nachricht- Von: basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de> [mailto:basex-talk-boun...@mailman.uni-konstanz.de] Im Auftrag von Fabrice ETANCHAUD Gesendet: Freitag, 15. September 2017 17:35 An: 'Anastasiou A.' <a.anastas...@swansea.ac.uk<mailto:a.anastas...@swansea.ac.uk>>; basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Betreff: Re: [basex-talk] Basex Inner Workings You can find the time spent in each step in the query info bar graph. If you are looking for the schema and the
Re: [basex-talk] Basex Inner Workings
Must be the day today, sorry, please see below: No, but it did not make any difference. But I will tell you what did make a difference, forcing everything to be a string and hard coding the names of the tags. That’s a ~3-4 sec query to return ~5 million items. I was led to this by what you said about computed elements because it makes perfect sense if BaseX has to create the document it returns, in memory, as a “proper” XML tree data structure. I am not particularly jumping up and down about this but it works for the moment for such a simple use case. It’s not best practice though so I would be more inclined to use the right way of speeding this query up if possible. By the way, there are now “computed” (in the sense of derived) fields in this query, in case you meant it that way. All the best From: Anastasiou A. Sent: 18 September 2017 14:29 To: 'Graydon Saunders'; basex-talk@mailman.uni-konstanz.de Subject: RE: [basex-talk] Basex Inner Workings No, but it did not make any difference. But I will tell you what did make a difference, forcing everything to be a string and hard coding the names of the tags. That’s a ~3-4 sec query to return ~5 million items. I was led to this by what you said about computed elements because it makes perfect sense if BaseX has to create the “document” it returns, in memory, as a prop From: basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de> [mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Graydon Saunders Sent: 18 September 2017 14:01 To: basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Basex Inner Workings Sorry for the fumble-fingers; let me try that again. Have you tried creating literal elements? Computed elements have overhead; it's presumptively akin to why you don't want to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable might be anything and needs a whole document node to exist in, and this is expensive). In this case, I'd be darkly suspicious the computed elements are computing their contents every time. I'd be trying for ... let $elem1 as element() := ... let $elem2 as element() := ... {$elem1,$elem2} instead of the computed element. The optimizer is really good in BaseX but it's also really complicated; the local maxima can be quite narrow. On Mon, Sep 18, 2017 at 8:58 AM, Graydon Saunders <graydon...@gmail.com<mailto:graydon...@gmail.com>> wrote: Have you tried creating literal elements? Computed elements have overhead; it's presumptively akin to why you don't want to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable might be anything and needs a whole document node to exist in, and this is expensive). In this case, I'd be darkly suspicious the computed elements are computing their contents every time. I'd be trying for ... let $elem1 as element() := ... let $elem2 as element() := ... On Mon, Sep 18, 2017 at 8:46 AM, Anastasiou A. <a.anastas...@swansea.ac.uk<mailto:a.anastas...@swansea.ac.uk>> wrote: Hello Many thanks, Dirk, Fabrice and Graydon. I was going to look up ways of enabling the server to run as fast as possible anyway later on, so it is always good to know how is BaseX “thinking”. I can see what you mean Graydon. This is a simple nested `for` to denormalise some of the structures of the XML file, where “some” is defined by an XPath expression. As far as I can tell, there is nothing being re-evaluated repeatedly within the inner loop that could be brought outside. I have gone through the dot plans of the quickest and slowest versions of the query and the only thing they differ is in the addition of the CElems. The “scaling” of the timings, in case it helps, is as follows: Simple query, returning elements: 1100-1500 ms Adding an `element` to what is returned just by the innermost `for`: 7500-9311 ms This means: For… For…. Return element item{someElement|someOtherElement} Adding an `element` to the whole block (no `element` to the innermost `for`):49000-67000ms This means: Element Items{ For… For… Return someElement|someOtherElement } Adding an `element` to both places: 5-8ms This means: Element Items{ For… For … Return element Item {someElement|someOtherElement} } I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s going to be a bit annoying. All the best From: Graydon Saunders [mailto:graydon...@gmail.com<mailto:graydon...@gmail.com>] Sent: 15 September 2017 17:04 To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Basex Inner Workings As a follow-on to Dirk, it's amazing how much of a performance difference it can make to use typed variables when you're constructing something for output. (So far as I can tell, variables de
Re: [basex-talk] Basex Inner Workings
No, but it did not make any difference. But I will tell you what did make a difference, forcing everything to be a string and hard coding the names of the tags. That’s a ~3-4 sec query to return ~5 million items. I was led to this by what you said about computed elements because it makes perfect sense if BaseX has to create the “document” it returns, in memory, as a prop From: basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Graydon Saunders Sent: 18 September 2017 14:01 To: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Basex Inner Workings Sorry for the fumble-fingers; let me try that again. Have you tried creating literal elements? Computed elements have overhead; it's presumptively akin to why you don't want to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable might be anything and needs a whole document node to exist in, and this is expensive). In this case, I'd be darkly suspicious the computed elements are computing their contents every time. I'd be trying for ... let $elem1 as element() := ... let $elem2 as element() := ... {$elem1,$elem2} instead of the computed element. The optimizer is really good in BaseX but it's also really complicated; the local maxima can be quite narrow. On Mon, Sep 18, 2017 at 8:58 AM, Graydon Saunders <graydon...@gmail.com<mailto:graydon...@gmail.com>> wrote: Have you tried creating literal elements? Computed elements have overhead; it's presumptively akin to why you don't want to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable might be anything and needs a whole document node to exist in, and this is expensive). In this case, I'd be darkly suspicious the computed elements are computing their contents every time. I'd be trying for ... let $elem1 as element() := ... let $elem2 as element() := ... On Mon, Sep 18, 2017 at 8:46 AM, Anastasiou A. <a.anastas...@swansea.ac.uk<mailto:a.anastas...@swansea.ac.uk>> wrote: Hello Many thanks, Dirk, Fabrice and Graydon. I was going to look up ways of enabling the server to run as fast as possible anyway later on, so it is always good to know how is BaseX “thinking”. I can see what you mean Graydon. This is a simple nested `for` to denormalise some of the structures of the XML file, where “some” is defined by an XPath expression. As far as I can tell, there is nothing being re-evaluated repeatedly within the inner loop that could be brought outside. I have gone through the dot plans of the quickest and slowest versions of the query and the only thing they differ is in the addition of the CElems. The “scaling” of the timings, in case it helps, is as follows: Simple query, returning elements: 1100-1500 ms Adding an `element` to what is returned just by the innermost `for`: 7500-9311 ms This means: For… For…. Return element item{someElement|someOtherElement} Adding an `element` to the whole block (no `element` to the innermost `for`):49000-67000ms This means: Element Items{ For… For… Return someElement|someOtherElement } Adding an `element` to both places: 5-8ms This means: Element Items{ For… For … Return element Item {someElement|someOtherElement} } I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s going to be a bit annoying. All the best From: Graydon Saunders [mailto:graydon...@gmail.com<mailto:graydon...@gmail.com>] Sent: 15 September 2017 17:04 To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Basex Inner Workings As a follow-on to Dirk, it's amazing how much of a performance difference it can make to use typed variables when you're constructing something for output. (So far as I can tell, variables declarations function as an "optimize this!" flag for BaseX.) If you get good performance when you're just throwing the resulting nodes and lose it massively by adding structure, as you relate up there somewhere are: The change was to go from simply returning the nodes themselves with a `return thisnode | thatnode |theothernode` to a "formatted" document that has an outer with a number of `return {thisNode|thatNode|theOtherNode}` inside it. my immediate thought was "it's querying the same thing multiple times". Most programming languages it's good practice to not create variables when you can inline. XQuery does not appear to be one of those languages. :) I try to think of this as "how can I make things easy for the optimizer?" -- Graydon On Fri, Sep 15, 2017 at 11:55 AM, Kirsten, Dirk <dirk.kirs...@senacor.com<mailto:dirk.kirs...@senacor.com>> wrote: Hello Athanasios, I think you should really check the actual query plan which is executed. If you have such a huge spike in performance surely they processor
Re: [basex-talk] Basex Inner Workings
Sorry for the fumble-fingers; let me try that again. Have you tried creating literal elements? Computed elements have overhead; it's presumptively akin to why you don't want to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable might be anything and needs a whole document node to exist in, and this is expensive). In this case, I'd be darkly suspicious the computed elements are computing their contents every time. I'd be trying for ... let $elem1 as element() := ... let $elem2 as element() := ... {$elem1,$elem2} instead of the computed element. The optimizer is really good in BaseX but it's also really complicated; the local maxima can be quite narrow. On Mon, Sep 18, 2017 at 8:58 AM, Graydon Saunders <graydon...@gmail.com> wrote: > Have you tried creating literal elements? > > Computed elements have overhead; it's presumptively akin to why you don't > want to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable > might be anything and needs a whole document node to exist in, and this is > expensive). In this case, I'd be darkly suspicious the computed elements > are computing their contents every time. > > I'd be trying > for ... > let $elem1 as element() := ... > let $elem2 as element() := ... > > > On Mon, Sep 18, 2017 at 8:46 AM, Anastasiou A. <a.anastas...@swansea.ac.uk > > wrote: > >> Hello >> >> >> >> Many thanks, Dirk, Fabrice and Graydon. >> >> >> >> I was going to look up ways of enabling the server to run as fast as >> possible anyway later on, so it is always good to know how is BaseX >> “thinking”. >> >> >> >> I can see what you mean Graydon. This is a simple nested `for` to >> denormalise some of the structures of the XML file, where “some” is defined >> by >> an XPath expression. >> >> >> >> As far as I can tell, there is nothing being re-evaluated repeatedly >> within the inner loop that could be brought outside. >> >> >> >> I have gone through the dot plans of the quickest and slowest versions of >> the query and the only thing they differ is in the addition of the CElems. >> >> >> >> The “scaling” of the timings, in case it helps, is as follows: >> >> >> >> Simple query, returning elements: 1100-1500 ms >> >> >> >> Adding an `element` to what is returned just by the innermost `for`: >> 7500-9311 ms >> >> This means: >> >> For… >> >>For…. >> >> Return element item{someElement|someOtherElement} >> >> >> >> Adding an `element` to the whole block (no `element` to the innermost >> `for`):49000-67000ms >> This means: >> >> Element Items{ >> >> For… >> >> For… >> >> Return someElement|someOtherElement >> >> } >> >> >> >> Adding an `element` to both places: 5-8ms >> >> This means: >> >> Element Items{ >> >> For… >> >> For … >> >> Return element Item {someElement|someOtherElement} >> >> } >> >> >> >> >> >> I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s >> going to be a bit annoying. >> >> >> >> All the best >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *From:* Graydon Saunders [mailto:graydon...@gmail.com] >> *Sent:* 15 September 2017 17:04 >> *To:* Anastasiou A.; basex-talk@mailman.uni-konstanz.de >> *Subject:* Re: [basex-talk] Basex Inner Workings >> >> >> >> As a follow-on to Dirk, it's amazing how much of a performance difference >> it can make to use typed variables when you're constructing something for >> output. (So far as I can tell, variables declarations function as an >> "optimize this!" flag for BaseX.) >> >> >> >> If you get good performance when you're just throwing the resulting nodes >> and lose it massively by adding structure, as you relate up there somewhere >> are: >> >> *The change was to go from simply returning the nodes themselves with a >> `return thisnode | thatnode |theothernode` to a "formatted" document that >> has an outer with a number of `return >> {thisNode|thatNode|theOtherNode}` inside it.* >> >> >> >> my immediate thought was "it's querying the same thing multiple times". >> >> >> >> Most programming languages it's good practice to not creat
Re: [basex-talk] Basex Inner Workings
Have you tried creating literal elements? Computed elements have overhead; it's presumptively akin to why you don't want to create untyped variables in XSLT 2.0 and 3.0 (an untyped variable might be anything and needs a whole document node to exist in, and this is expensive). In this case, I'd be darkly suspicious the computed elements are computing their contents every time. I'd be trying for ... let $elem1 as element() := ... let $elem2 as element() := ... On Mon, Sep 18, 2017 at 8:46 AM, Anastasiou A. <a.anastas...@swansea.ac.uk> wrote: > Hello > > > > Many thanks, Dirk, Fabrice and Graydon. > > > > I was going to look up ways of enabling the server to run as fast as > possible anyway later on, so it is always good to know how is BaseX > “thinking”. > > > > I can see what you mean Graydon. This is a simple nested `for` to > denormalise some of the structures of the XML file, where “some” is defined > by > an XPath expression. > > > > As far as I can tell, there is nothing being re-evaluated repeatedly > within the inner loop that could be brought outside. > > > > I have gone through the dot plans of the quickest and slowest versions of > the query and the only thing they differ is in the addition of the CElems. > > > > The “scaling” of the timings, in case it helps, is as follows: > > > > Simple query, returning elements: 1100-1500 ms > > > > Adding an `element` to what is returned just by the innermost `for`: > 7500-9311 ms > > This means: > > For… > >For…. > > Return element item{someElement|someOtherElement} > > > > Adding an `element` to the whole block (no `element` to the innermost > `for`):49000-67000ms > This means: > > Element Items{ > > For… > > For… > > Return someElement|someOtherElement > > } > > > > Adding an `element` to both places: 5-8ms > > This means: > > Element Items{ > > For… > > For … > > Return element Item {someElement|someOtherElement} > > } > > > > > > I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s > going to be a bit annoying. > > > > All the best > > > > > > > > > > > > > > > > *From:* Graydon Saunders [mailto:graydon...@gmail.com] > *Sent:* 15 September 2017 17:04 > *To:* Anastasiou A.; basex-talk@mailman.uni-konstanz.de > *Subject:* Re: [basex-talk] Basex Inner Workings > > > > As a follow-on to Dirk, it's amazing how much of a performance difference > it can make to use typed variables when you're constructing something for > output. (So far as I can tell, variables declarations function as an > "optimize this!" flag for BaseX.) > > > > If you get good performance when you're just throwing the resulting nodes > and lose it massively by adding structure, as you relate up there somewhere > are: > > *The change was to go from simply returning the nodes themselves with a > `return thisnode | thatnode |theothernode` to a "formatted" document that > has an outer with a number of `return > {thisNode|thatNode|theOtherNode}` inside it.* > > > > my immediate thought was "it's querying the same thing multiple times". > > > > Most programming languages it's good practice to not create variables when > you can inline. XQuery does not appear to be one of those languages. :) I > try to think of this as "how can I make things easy for the optimizer?" > > > > -- Graydon > > > > On Fri, Sep 15, 2017 at 11:55 AM, Kirsten, Dirk <dirk.kirs...@senacor.com> > wrote: > > Hello Athanasios, > > I think you should really check the actual query plan which is executed. > If you have such a huge spike in performance surely they processor will be > executing it differently. I don't think looking into file access patterns > BaseX internally uses is very useful for an end user. You should let BaseX > handle that (but of course, if you find better/more efficient ways I am > sure Christian' gladly accepts Pull Requests). But the pattern you describe > sounds very much excepted, so reads if you open databases seem logical and > short write operations are also expected when just reading a database, > because e.g. BaseX has to lock the databases. > > So I think it would be more useful to look into the query plan. Of course > you are more than welcome to ask about what is going on there on this list. > I would expect that because of your rewrite maybe some indexes are not > applied anymore (or if your rewrite is simply very big that most of the > time is spent seriali
Re: [basex-talk] Basex Inner Workings
Hello Many thanks, Dirk, Fabrice and Graydon. I was going to look up ways of enabling the server to run as fast as possible anyway later on, so it is always good to know how is BaseX “thinking”. I can see what you mean Graydon. This is a simple nested `for` to denormalise some of the structures of the XML file, where “some” is defined by an XPath expression. As far as I can tell, there is nothing being re-evaluated repeatedly within the inner loop that could be brought outside. I have gone through the dot plans of the quickest and slowest versions of the query and the only thing they differ is in the addition of the CElems. The “scaling” of the timings, in case it helps, is as follows: Simple query, returning elements: 1100-1500 ms Adding an `element` to what is returned just by the innermost `for`: 7500-9311 ms This means: For… For…. Return element item{someElement|someOtherElement} Adding an `element` to the whole block (no `element` to the innermost `for`):49000-67000ms This means: Element Items{ For… For… Return someElement|someOtherElement } Adding an `element` to both places: 5-8ms This means: Element Items{ For… For … Return element Item {someElement|someOtherElement} } I don’t mind the ~8sec time but when we get to 1.5min, then yes…that’s going to be a bit annoying. All the best From: Graydon Saunders [mailto:graydon...@gmail.com] Sent: 15 September 2017 17:04 To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Basex Inner Workings As a follow-on to Dirk, it's amazing how much of a performance difference it can make to use typed variables when you're constructing something for output. (So far as I can tell, variables declarations function as an "optimize this!" flag for BaseX.) If you get good performance when you're just throwing the resulting nodes and lose it massively by adding structure, as you relate up there somewhere are: The change was to go from simply returning the nodes themselves with a `return thisnode | thatnode |theothernode` to a "formatted" document that has an outer with a number of `return {thisNode|thatNode|theOtherNode}` inside it. my immediate thought was "it's querying the same thing multiple times". Most programming languages it's good practice to not create variables when you can inline. XQuery does not appear to be one of those languages. :) I try to think of this as "how can I make things easy for the optimizer?" -- Graydon On Fri, Sep 15, 2017 at 11:55 AM, Kirsten, Dirk <dirk.kirs...@senacor.com<mailto:dirk.kirs...@senacor.com>> wrote: Hello Athanasios, I think you should really check the actual query plan which is executed. If you have such a huge spike in performance surely they processor will be executing it differently. I don't think looking into file access patterns BaseX internally uses is very useful for an end user. You should let BaseX handle that (but of course, if you find better/more efficient ways I am sure Christian' gladly accepts Pull Requests). But the pattern you describe sounds very much excepted, so reads if you open databases seem logical and short write operations are also expected when just reading a database, because e.g. BaseX has to lock the databases. So I think it would be more useful to look into the query plan. Of course you are more than welcome to ask about what is going on there on this list. I would expect that because of your rewrite maybe some indexes are not applied anymore (or if your rewrite is simply very big that most of the time is spent serializing the data). Cheers Dirk Senacor Technologies Aktiengesellschaft - Sitz: Eschborn - Amtsgericht Frankfurt am Main - Reg.-Nr.: HRB 105546 Vorstand: Matthias Tomann, Marcus Purzer - Aufsichtsratsvorsitzender: Daniel Grözinger -Ursprüngliche Nachricht- Von: basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de> [mailto:basex-talk-boun...@mailman.uni-konstanz.de] Im Auftrag von Fabrice ETANCHAUD Gesendet: Freitag, 15. September 2017 17:35 An: 'Anastasiou A.' <a.anastas...@swansea.ac.uk<mailto:a.anastas...@swansea.ac.uk>>; basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Betreff: Re: [basex-talk] Basex Inner Workings You can find the time spent in each step in the query info bar graph. If you are looking for the schema and the facets of your dataset, you should have a look at the index module, and for sure at index:facets() Best regards, Fabrice -Message d'origine- De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk] Envoyé : vendredi 15 septembre 2017 17:23 À : Fabrice ETANCHAUD; basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> Objet : RE: Basex Inner Workings Thank you Fabrice. I understand. I have not tried querying from the
Re: [basex-talk] Basex Inner Workings
As a follow-on to Dirk, it's amazing how much of a performance difference it can make to use typed variables when you're constructing something for output. (So far as I can tell, variables declarations function as an "optimize this!" flag for BaseX.) If you get good performance when you're just throwing the resulting nodes and lose it massively by adding structure, as you relate up there somewhere are: *The change was to go from simply returning the nodes themselves with a `return thisnode | thatnode |theothernode` to a "formatted" document that has an outer with a number of `return {thisNode|thatNode|theOtherNode}` inside it.* my immediate thought was "it's querying the same thing multiple times". Most programming languages it's good practice to not create variables when you can inline. XQuery does not appear to be one of those languages. :) I try to think of this as "how can I make things easy for the optimizer?" -- Graydon On Fri, Sep 15, 2017 at 11:55 AM, Kirsten, Dirk <dirk.kirs...@senacor.com> wrote: > Hello Athanasios, > > I think you should really check the actual query plan which is executed. > If you have such a huge spike in performance surely they processor will be > executing it differently. I don't think looking into file access patterns > BaseX internally uses is very useful for an end user. You should let BaseX > handle that (but of course, if you find better/more efficient ways I am > sure Christian' gladly accepts Pull Requests). But the pattern you describe > sounds very much excepted, so reads if you open databases seem logical and > short write operations are also expected when just reading a database, > because e.g. BaseX has to lock the databases. > > So I think it would be more useful to look into the query plan. Of course > you are more than welcome to ask about what is going on there on this list. > I would expect that because of your rewrite maybe some indexes are not > applied anymore (or if your rewrite is simply very big that most of the > time is spent serializing the data). > > Cheers > Dirk > > > Senacor Technologies Aktiengesellschaft - Sitz: Eschborn - > Amtsgericht Frankfurt am Main - Reg.-Nr.: HRB 105546 > Vorstand: Matthias Tomann, Marcus Purzer - Aufsichtsratsvorsitzender: > Daniel Grözinger > > -Ursprüngliche Nachricht- > Von: basex-talk-boun...@mailman.uni-konstanz.de [ > mailto:basex-talk-boun...@mailman.uni-konstanz.de > <basex-talk-boun...@mailman.uni-konstanz.de>] Im Auftrag von Fabrice > ETANCHAUD > Gesendet: Freitag, 15. September 2017 17:35 > An: 'Anastasiou A.' <a.anastas...@swansea.ac.uk>; basex-t...@mailman.uni- > konstanz.de > Betreff: Re: [basex-talk] Basex Inner Workings > > > You can find the time spent in each step in the query info bar graph. > > If you are looking for the schema and the facets of your dataset, you > should have a look at the index module, and for sure at index:facets() > > Best regards, > Fabrice > > -Message d'origine- > De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk > <a.anastas...@swansea.ac.uk>] > Envoyé : vendredi 15 septembre 2017 17:23 À : Fabrice ETANCHAUD; > basex-talk@mailman.uni-konstanz.de > Objet : RE: Basex Inner Workings > > Thank you Fabrice. I understand. > > I have not tried querying from the command prompt or sending the output to > a file directly, which I could also work with. But, my understanding is > that the time we are being quoted by the gui is the DB time, not taking > into account the time it takes for the list to be pushed into whatever data > structures the list boxes might be supporting (?). > > I am trying to get a better understanding of the dataset at the moment and > I have short and long queries which depending on the results I get from > this step could be optimised further. > > All the best > > -Original Message- > From: Fabrice ETANCHAUD [mailto:fetanch...@pch.cerfrance.fr > <fetanch...@pch.cerfrance.fr>] > Sent: 15 September 2017 16:17 > To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de > Subject: RE: Basex Inner Workings > > I understand that you are reformatting a lot of data, aren't you ? > I will have only little advice, because this is not my use case. > > From what I know, resulting document will be materialized entirely in > memory before presentation or export. > You should export your results to disk, in order not to lose time in > BaseXGUI rendering. > > To reformat very big amounts of data, you might have a look at saxon > streaming features (not in the free version). > > But usually, big results are not requested frequently. > > Best regards, > Fabrice > > -Message d'origine- >
Re: [basex-talk] Basex Inner Workings
Hello Athanasios, I think you should really check the actual query plan which is executed. If you have such a huge spike in performance surely they processor will be executing it differently. I don't think looking into file access patterns BaseX internally uses is very useful for an end user. You should let BaseX handle that (but of course, if you find better/more efficient ways I am sure Christian' gladly accepts Pull Requests). But the pattern you describe sounds very much excepted, so reads if you open databases seem logical and short write operations are also expected when just reading a database, because e.g. BaseX has to lock the databases. So I think it would be more useful to look into the query plan. Of course you are more than welcome to ask about what is going on there on this list. I would expect that because of your rewrite maybe some indexes are not applied anymore (or if your rewrite is simply very big that most of the time is spent serializing the data). Cheers Dirk Senacor Technologies Aktiengesellschaft - Sitz: Eschborn - Amtsgericht Frankfurt am Main - Reg.-Nr.: HRB 105546 Vorstand: Matthias Tomann, Marcus Purzer - Aufsichtsratsvorsitzender: Daniel Grözinger -Ursprüngliche Nachricht- Von: basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] Im Auftrag von Fabrice ETANCHAUD Gesendet: Freitag, 15. September 2017 17:35 An: 'Anastasiou A.' <a.anastas...@swansea.ac.uk>; basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Basex Inner Workings You can find the time spent in each step in the query info bar graph. If you are looking for the schema and the facets of your dataset, you should have a look at the index module, and for sure at index:facets() Best regards, Fabrice -Message d'origine- De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk] Envoyé : vendredi 15 septembre 2017 17:23 À : Fabrice ETANCHAUD; basex-talk@mailman.uni-konstanz.de Objet : RE: Basex Inner Workings Thank you Fabrice. I understand. I have not tried querying from the command prompt or sending the output to a file directly, which I could also work with. But, my understanding is that the time we are being quoted by the gui is the DB time, not taking into account the time it takes for the list to be pushed into whatever data structures the list boxes might be supporting (?). I am trying to get a better understanding of the dataset at the moment and I have short and long queries which depending on the results I get from this step could be optimised further. All the best -Original Message- From: Fabrice ETANCHAUD [mailto:fetanch...@pch.cerfrance.fr] Sent: 15 September 2017 16:17 To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de Subject: RE: Basex Inner Workings I understand that you are reformatting a lot of data, aren't you ? I will have only little advice, because this is not my use case. >From what I know, resulting document will be materialized entirely in memory >before presentation or export. You should export your results to disk, in order not to lose time in BaseXGUI rendering. To reformat very big amounts of data, you might have a look at saxon streaming features (not in the free version). But usually, big results are not requested frequently. Best regards, Fabrice -Message d'origine- De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk] Envoyé : vendredi 15 septembre 2017 16:39 À : Fabrice ETANCHAUD; basex-talk@mailman.uni-konstanz.de Objet : RE: Basex Inner Workings Hello Fabrice Yes, I am having a query which jumped from ~1500 ms to about a minute with a tiny little change... The DB is about 2GB and it is my test set before putting the query to work on the full dataset. The change was to go from simply returning the nodes themselves with a `return thisnode | thatnode |theothernode` to a "formatted" document that has an outer with a number of `return {thisNode|thatNode|theOtherNode}` inside it. I understand that the new query might be creating some new entities but compared to the element content, these few extra characters are not THAT many more. The query jumps from ~1500 ms when using plain XML, to ~55000ms with the addition of the collection, item nodes, to ~57000ms with the addition of CSV exporting via the CSV module. These are "informal average" values. So, I have not run the same query a few times and then obtain the average, but that's the sort of vicinity I have seen numbers in from the times I have run the queries so far. The database itself is "static", there are no update/insert transactions at the moment, the only thing that I am trying to do is extract some data in a different format from it. I have Text, Attribute and Token indexes on that database (optimised right after importing) but no further options enabled. I also have not experimented with the SPLITSIZE (?). I have 32GB of memory and it
Re: [basex-talk] Basex Inner Workings
You can find the time spent in each step in the query info bar graph. If you are looking for the schema and the facets of your dataset, you should have a look at the index module, and for sure at index:facets() Best regards, Fabrice -Message d'origine- De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk] Envoyé : vendredi 15 septembre 2017 17:23 À : Fabrice ETANCHAUD; basex-talk@mailman.uni-konstanz.de Objet : RE: Basex Inner Workings Thank you Fabrice. I understand. I have not tried querying from the command prompt or sending the output to a file directly, which I could also work with. But, my understanding is that the time we are being quoted by the gui is the DB time, not taking into account the time it takes for the list to be pushed into whatever data structures the list boxes might be supporting (?). I am trying to get a better understanding of the dataset at the moment and I have short and long queries which depending on the results I get from this step could be optimised further. All the best -Original Message- From: Fabrice ETANCHAUD [mailto:fetanch...@pch.cerfrance.fr] Sent: 15 September 2017 16:17 To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de Subject: RE: Basex Inner Workings I understand that you are reformatting a lot of data, aren't you ? I will have only little advice, because this is not my use case. >From what I know, resulting document will be materialized entirely in memory >before presentation or export. You should export your results to disk, in order not to lose time in BaseXGUI rendering. To reformat very big amounts of data, you might have a look at saxon streaming features (not in the free version). But usually, big results are not requested frequently. Best regards, Fabrice -Message d'origine- De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk] Envoyé : vendredi 15 septembre 2017 16:39 À : Fabrice ETANCHAUD; basex-talk@mailman.uni-konstanz.de Objet : RE: Basex Inner Workings Hello Fabrice Yes, I am having a query which jumped from ~1500 ms to about a minute with a tiny little change... The DB is about 2GB and it is my test set before putting the query to work on the full dataset. The change was to go from simply returning the nodes themselves with a `return thisnode | thatnode |theothernode` to a "formatted" document that has an outer with a number of `return {thisNode|thatNode|theOtherNode}` inside it. I understand that the new query might be creating some new entities but compared to the element content, these few extra characters are not THAT many more. The query jumps from ~1500 ms when using plain XML, to ~55000ms with the addition of the collection, item nodes, to ~57000ms with the addition of CSV exporting via the CSV module. These are "informal average" values. So, I have not run the same query a few times and then obtain the average, but that's the sort of vicinity I have seen numbers in from the times I have run the queries so far. The database itself is "static", there are no update/insert transactions at the moment, the only thing that I am trying to do is extract some data in a different format from it. I have Text, Attribute and Token indexes on that database (optimised right after importing) but no further options enabled. I also have not experimented with the SPLITSIZE (?). I have 32GB of memory and it should be enough to handle this 2GB test dataset (?). I will have a go with DEBUG on. Did you have to enable any additional options for indexes to work faster? All the best -Original Message- From: basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice ETANCHAUD Sent: 15 September 2017 13:27 To: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Basex Inner Workings Hi Athanasios, Did you experience slow queries ? Are you sure to use all the index features ? Are these queries operational ones (direct access on a key value) or analytics ? I never experienced slow queries, even on huge xml corpus (patent registrations), But this is at the cost of longer indexing times on updates. Best regards, -Message d'origine- De : basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A. Envoyé : vendredi 15 septembre 2017 14:01 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Basex Inner Workings Hello everyone Quick question: Is there any document / URL where I could find out more about how does Basex access the disk during its operation? For example, are there any reads to be expected during executing a query? Through iotop, I can see 3-4 processes reading during startup, then another 2, very briefly firing when opening the database and then during querying there are periodic writes (?) but of very brief duration. I was wondering if there is anything that
Re: [basex-talk] Basex Inner Workings
Thank you Fabrice. I understand. I have not tried querying from the command prompt or sending the output to a file directly, which I could also work with. But, my understanding is that the time we are being quoted by the gui is the DB time, not taking into account the time it takes for the list to be pushed into whatever data structures the list boxes might be supporting (?). I am trying to get a better understanding of the dataset at the moment and I have short and long queries which depending on the results I get from this step could be optimised further. All the best -Original Message- From: Fabrice ETANCHAUD [mailto:fetanch...@pch.cerfrance.fr] Sent: 15 September 2017 16:17 To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de Subject: RE: Basex Inner Workings I understand that you are reformatting a lot of data, aren't you ? I will have only little advice, because this is not my use case. >From what I know, resulting document will be materialized entirely in memory >before presentation or export. You should export your results to disk, in order not to lose time in BaseXGUI rendering. To reformat very big amounts of data, you might have a look at saxon streaming features (not in the free version). But usually, big results are not requested frequently. Best regards, Fabrice -Message d'origine- De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk] Envoyé : vendredi 15 septembre 2017 16:39 À : Fabrice ETANCHAUD; basex-talk@mailman.uni-konstanz.de Objet : RE: Basex Inner Workings Hello Fabrice Yes, I am having a query which jumped from ~1500 ms to about a minute with a tiny little change... The DB is about 2GB and it is my test set before putting the query to work on the full dataset. The change was to go from simply returning the nodes themselves with a `return thisnode | thatnode |theothernode` to a "formatted" document that has an outer with a number of `return {thisNode|thatNode|theOtherNode}` inside it. I understand that the new query might be creating some new entities but compared to the element content, these few extra characters are not THAT many more. The query jumps from ~1500 ms when using plain XML, to ~55000ms with the addition of the collection, item nodes, to ~57000ms with the addition of CSV exporting via the CSV module. These are "informal average" values. So, I have not run the same query a few times and then obtain the average, but that's the sort of vicinity I have seen numbers in from the times I have run the queries so far. The database itself is "static", there are no update/insert transactions at the moment, the only thing that I am trying to do is extract some data in a different format from it. I have Text, Attribute and Token indexes on that database (optimised right after importing) but no further options enabled. I also have not experimented with the SPLITSIZE (?). I have 32GB of memory and it should be enough to handle this 2GB test dataset (?). I will have a go with DEBUG on. Did you have to enable any additional options for indexes to work faster? All the best -Original Message- From: basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice ETANCHAUD Sent: 15 September 2017 13:27 To: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Basex Inner Workings Hi Athanasios, Did you experience slow queries ? Are you sure to use all the index features ? Are these queries operational ones (direct access on a key value) or analytics ? I never experienced slow queries, even on huge xml corpus (patent registrations), But this is at the cost of longer indexing times on updates. Best regards, -Message d'origine- De : basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A. Envoyé : vendredi 15 septembre 2017 14:01 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Basex Inner Workings Hello everyone Quick question: Is there any document / URL where I could find out more about how does Basex access the disk during its operation? For example, are there any reads to be expected during executing a query? Through iotop, I can see 3-4 processes reading during startup, then another 2, very briefly firing when opening the database and then during querying there are periodic writes (?) but of very brief duration. I was wondering if there is anything that could be done from the point of view of the hardware to speed up queries (?) (except a more powerful machine at the moment) All the best Athanasios Anastasiou
Re: [basex-talk] Basex Inner Workings
I understand that you are reformatting a lot of data, aren't you ? I will have only little advice, because this is not my use case. >From what I know, resulting document will be materialized entirely in memory >before presentation or export. You should export your results to disk, in order not to lose time in BaseXGUI rendering. To reformat very big amounts of data, you might have a look at saxon streaming features (not in the free version). But usually, big results are not requested frequently. Best regards, Fabrice -Message d'origine- De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk] Envoyé : vendredi 15 septembre 2017 16:39 À : Fabrice ETANCHAUD; basex-talk@mailman.uni-konstanz.de Objet : RE: Basex Inner Workings Hello Fabrice Yes, I am having a query which jumped from ~1500 ms to about a minute with a tiny little change... The DB is about 2GB and it is my test set before putting the query to work on the full dataset. The change was to go from simply returning the nodes themselves with a `return thisnode | thatnode |theothernode` to a "formatted" document that has an outer with a number of `return {thisNode|thatNode|theOtherNode}` inside it. I understand that the new query might be creating some new entities but compared to the element content, these few extra characters are not THAT many more. The query jumps from ~1500 ms when using plain XML, to ~55000ms with the addition of the collection, item nodes, to ~57000ms with the addition of CSV exporting via the CSV module. These are "informal average" values. So, I have not run the same query a few times and then obtain the average, but that's the sort of vicinity I have seen numbers in from the times I have run the queries so far. The database itself is "static", there are no update/insert transactions at the moment, the only thing that I am trying to do is extract some data in a different format from it. I have Text, Attribute and Token indexes on that database (optimised right after importing) but no further options enabled. I also have not experimented with the SPLITSIZE (?). I have 32GB of memory and it should be enough to handle this 2GB test dataset (?). I will have a go with DEBUG on. Did you have to enable any additional options for indexes to work faster? All the best -Original Message- From: basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice ETANCHAUD Sent: 15 September 2017 13:27 To: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Basex Inner Workings Hi Athanasios, Did you experience slow queries ? Are you sure to use all the index features ? Are these queries operational ones (direct access on a key value) or analytics ? I never experienced slow queries, even on huge xml corpus (patent registrations), But this is at the cost of longer indexing times on updates. Best regards, -Message d'origine- De : basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A. Envoyé : vendredi 15 septembre 2017 14:01 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Basex Inner Workings Hello everyone Quick question: Is there any document / URL where I could find out more about how does Basex access the disk during its operation? For example, are there any reads to be expected during executing a query? Through iotop, I can see 3-4 processes reading during startup, then another 2, very briefly firing when opening the database and then during querying there are periodic writes (?) but of very brief duration. I was wondering if there is anything that could be done from the point of view of the hardware to speed up queries (?) (except a more powerful machine at the moment) All the best Athanasios Anastasiou
Re: [basex-talk] Basex Inner Workings
Hello Fabrice Yes, I am having a query which jumped from ~1500 ms to about a minute with a tiny little change... The DB is about 2GB and it is my test set before putting the query to work on the full dataset. The change was to go from simply returning the nodes themselves with a `return thisnode | thatnode |theothernode` to a "formatted" document that has an outer with a number of `return {thisNode|thatNode|theOtherNode}` inside it. I understand that the new query might be creating some new entities but compared to the element content, these few extra characters are not THAT many more. The query jumps from ~1500 ms when using plain XML, to ~55000ms with the addition of the collection, item nodes, to ~57000ms with the addition of CSV exporting via the CSV module. These are "informal average" values. So, I have not run the same query a few times and then obtain the average, but that's the sort of vicinity I have seen numbers in from the times I have run the queries so far. The database itself is "static", there are no update/insert transactions at the moment, the only thing that I am trying to do is extract some data in a different format from it. I have Text, Attribute and Token indexes on that database (optimised right after importing) but no further options enabled. I also have not experimented with the SPLITSIZE (?). I have 32GB of memory and it should be enough to handle this 2GB test dataset (?). I will have a go with DEBUG on. Did you have to enable any additional options for indexes to work faster? All the best -Original Message- From: basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice ETANCHAUD Sent: 15 September 2017 13:27 To: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Basex Inner Workings Hi Athanasios, Did you experience slow queries ? Are you sure to use all the index features ? Are these queries operational ones (direct access on a key value) or analytics ? I never experienced slow queries, even on huge xml corpus (patent registrations), But this is at the cost of longer indexing times on updates. Best regards, -Message d'origine- De : basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A. Envoyé : vendredi 15 septembre 2017 14:01 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Basex Inner Workings Hello everyone Quick question: Is there any document / URL where I could find out more about how does Basex access the disk during its operation? For example, are there any reads to be expected during executing a query? Through iotop, I can see 3-4 processes reading during startup, then another 2, very briefly firing when opening the database and then during querying there are periodic writes (?) but of very brief duration. I was wondering if there is anything that could be done from the point of view of the hardware to speed up queries (?) (except a more powerful machine at the moment) All the best Athanasios Anastasiou
Re: [basex-talk] Basex Inner Workings
Hello Alexander The thesis is a fantastic resource for getting to know a bit more about Basex's inner workings, thank you very much. I had seen the storage_layout already but I was trying to understand if there is anything that can be done at the file system level. This was also because I read that parallel operations could result in patterns that cannot be handled by caching efficiently (which is a very good point anyway). All the best -Original Message- From: Alexander Holupirek [mailto:a...@holupirek.de] Sent: 15 September 2017 13:56 To: Anastasiou A. Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Basex Inner Workings > On 15. Sep 2017, at 14:00, Anastasiou A. <a.anastas...@swansea.ac.uk> wrote: > Quick question: Is there any document / URL where I could find out more about > how does Basex access the disk during its operation? > > For example, are there any reads to be expected during executing a query? You can have a look at Christian's dissertation: http://files.basex.org/publications/Gruen%20[2010],%20Storing%20and%20Querying%20Large%20XML%20Instances.pdf That way you can at least get a picture of the inner organisation of the storage system and may deduce some access patterns? http://docs.basex.org/wiki/Storage_Layout may help as well?
Re: [basex-talk] Basex Inner Workings
> On 15. Sep 2017, at 14:00, Anastasiou A.wrote: > Quick question: Is there any document / URL where I could find out more about > how does Basex access the disk during its operation? > > For example, are there any reads to be expected during executing a query? You can have a look at Christian's dissertation: http://files.basex.org/publications/Gruen%20[2010],%20Storing%20and%20Querying%20Large%20XML%20Instances.pdf That way you can at least get a picture of the inner organisation of the storage system and may deduce some access patterns? http://docs.basex.org/wiki/Storage_Layout may help as well?
Re: [basex-talk] Basex Inner Workings
Hi Athanasios, Did you experience slow queries ? Are you sure to use all the index features ? Are these queries operational ones (direct access on a key value) or analytics ? I never experienced slow queries, even on huge xml corpus (patent registrations), But this is at the cost of longer indexing times on updates. Best regards, -Message d'origine- De : basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A. Envoyé : vendredi 15 septembre 2017 14:01 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Basex Inner Workings Hello everyone Quick question: Is there any document / URL where I could find out more about how does Basex access the disk during its operation? For example, are there any reads to be expected during executing a query? Through iotop, I can see 3-4 processes reading during startup, then another 2, very briefly firing when opening the database and then during querying there are periodic writes (?) but of very brief duration. I was wondering if there is anything that could be done from the point of view of the hardware to speed up queries (?) (except a more powerful machine at the moment) All the best Athanasios Anastasiou
[basex-talk] Basex Inner Workings
Hello everyone Quick question: Is there any document / URL where I could find out more about how does Basex access the disk during its operation? For example, are there any reads to be expected during executing a query? Through iotop, I can see 3-4 processes reading during startup, then another 2, very briefly firing when opening the database and then during querying there are periodic writes (?) but of very brief duration. I was wondering if there is anything that could be done from the point of view of the hardware to speed up queries (?) (except a more powerful machine at the moment) All the best Athanasios Anastasiou