Re: [basex-talk] improving query performance

2020-08-23 Thread Liam R. E. Quin
On Sun, 2020-08-23 at 14:05 -0700, Bill Osmond wrote:
> Indeed I have, with no positive results unfortunately. I'm now
> testing to
> see if having multiple return statements (as in Liam's queries)
> helps,
> although the results so far are basically the same.

I tried to make clear what was going on, using explit returns, The
syntax of a FLWOR expression allows either form:
  for $sock in /drawer/socks, $shoe in /tray/shoes
is the same as
  for $sock in /drawer/socks
for $shoe in /tray/shoes
and constructs every possible (sock, shoe) pair.

However,
  for $sock in /drawer/socks
  return
for $shoe in /tray/shoes
return
   some_expression($sock, $shoe)
maybe makes clearer that some_exoression() will be called
count(/drawer/socks) * count(/tray/shoes) times.

the way to speed this up is likely to construct many fewer tuples,
using grouping or windowing to process the inner part of your query.

Liam

-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org



Re: [basex-talk] improving query performance

2020-08-23 Thread Bill Osmond
Indeed I have, with no positive results unfortunately. I'm now testing to
see if having multiple return statements (as in Liam's queries) helps,
although the results so far are basically the same.

On Sat, Aug 22, 2020 at 9:59 AM Christian Grün 
wrote:

> Yes, I see now why my query returns much more hits than yours
> (including the first). As Liam already expressed, it’s not really a
> nested query what you are wanting to achieve.
>
> Oe thing you can always try is to change the order of your for clauses
> and see what happens. Maybe you did that already? In all cases, no
> index will be applied at the moments with these kinds of query
> patterns. I might get back to you later or tomorrow once I have
> another idea what could be done.
>
>
> On Sat, Aug 22, 2020 at 6:42 PM Bill Osmond  wrote:
> >
> > This is vexing - it seems as though the mechanism that provides the
> necessary "filtering" is the very thing that slows the execution down so
> much. This wouldn't have been obvious from the single example document I
> sent earlier, but each document stands alone: all of the searching and
> reference linking done for each TrackRelease in a NewReleaseMessage should
> only refer to other nodes in that same NewReleaseMessage.
> >
> > In my query, I started out with "for $r in /ernm:NewReleaseMessage" and
> I used $r on the right hand side of the subsequent for statements. It seems
> like without that, the execution is quick, but all the results from every
> document are getting matched to each other. With it, the results are
> correct, but the execution time shoots way up. In case any of you still
> have any patience for this question (and thanks again for everything so
> far!), I've attached a small sample set of 6 documents. The desired number
> of results from the query is 70 (which is the number of TrackReleases from
> all the documents combined), and the query that I've adapted from
> Christian's ddex2.xq which returns the right number of results is the
> following:
> >
> > declare namespace ernm = 'http://ddex.net/xml/ern/411';
> > (: declare context item := db:open('ddex'); :)
> >
> > for $r in /ernm:NewReleaseMessage
> >
> > for $party in $r/PartyList/Party[
> >   PartyReference/text() =
> >   $r/ReleaseList/TrackRelease/ReleaseLabelReference
> > ]
> > for $track_release in $r/ReleaseList/TrackRelease[
> >   ReleaseLabelReference/text() =
> >   $r/PartyList/Party/PartyReference
> > ]
> > for $sound_recording in $r/ResourceList/SoundRecording[
> >   ResourceReference/text() =
> >   $track_release/ReleaseResourceReference
> > ]
> > for $release in $r/ReleaseList/Release[
> >
>  
> ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference/text()
> =
> >   $track_release/ReleaseResourceReference
> > ]
> > return 
> >   { $track_release/ReleaseId/ISRC/text() }
> >   { fn:string-join($sound_recording/DisplayArtistName, '/')
> }
> >   { $sound_recording/DisplayTitleText/text() }
> >   { $release/DisplayTitleText/text() }
> >   { $release/ReleaseId/ICPN/text() }
> >   { $party/PartyName/FullName/text() }
> > 
>


-- 



*William Osmond*
Manager, System Operations & IT

3760 Motor Ave, Los Angeles, CA 90034
310-853-8737 | www.with.in


Re: [basex-talk] improving query performance

2020-08-22 Thread Christian Grün
Yes, I see now why my query returns much more hits than yours
(including the first). As Liam already expressed, it’s not really a
nested query what you are wanting to achieve.

Oe thing you can always try is to change the order of your for clauses
and see what happens. Maybe you did that already? In all cases, no
index will be applied at the moments with these kinds of query
patterns. I might get back to you later or tomorrow once I have
another idea what could be done.


On Sat, Aug 22, 2020 at 6:42 PM Bill Osmond  wrote:
>
> This is vexing - it seems as though the mechanism that provides the necessary 
> "filtering" is the very thing that slows the execution down so much. This 
> wouldn't have been obvious from the single example document I sent earlier, 
> but each document stands alone: all of the searching and reference linking 
> done for each TrackRelease in a NewReleaseMessage should only refer to other 
> nodes in that same NewReleaseMessage.
>
> In my query, I started out with "for $r in /ernm:NewReleaseMessage" and I 
> used $r on the right hand side of the subsequent for statements. It seems 
> like without that, the execution is quick, but all the results from every 
> document are getting matched to each other. With it, the results are correct, 
> but the execution time shoots way up. In case any of you still have any 
> patience for this question (and thanks again for everything so far!), I've 
> attached a small sample set of 6 documents. The desired number of results 
> from the query is 70 (which is the number of TrackReleases from all the 
> documents combined), and the query that I've adapted from Christian's 
> ddex2.xq which returns the right number of results is the following:
>
> declare namespace ernm = 'http://ddex.net/xml/ern/411';
> (: declare context item := db:open('ddex'); :)
>
> for $r in /ernm:NewReleaseMessage
>
> for $party in $r/PartyList/Party[
>   PartyReference/text() =
>   $r/ReleaseList/TrackRelease/ReleaseLabelReference
> ]
> for $track_release in $r/ReleaseList/TrackRelease[
>   ReleaseLabelReference/text() =
>   $r/PartyList/Party/PartyReference
> ]
> for $sound_recording in $r/ResourceList/SoundRecording[
>   ResourceReference/text() =
>   $track_release/ReleaseResourceReference
> ]
> for $release in $r/ReleaseList/Release[
>   
> ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference/text()
>  =
>   $track_release/ReleaseResourceReference
> ]
> return 
>   { $track_release/ReleaseId/ISRC/text() }
>   { fn:string-join($sound_recording/DisplayArtistName, '/') }
>   { $sound_recording/DisplayTitleText/text() }
>   { $release/DisplayTitleText/text() }
>   { $release/ReleaseId/ICPN/text() }
>   { $party/PartyName/FullName/text() }
> 


Re: [basex-talk] improving query performance

2020-08-22 Thread Christian Grün
That's good to hear. My rewritten query was based on the query of your
first post, and I already guessed that all the nested loops are not really
wanted or required.

Looking forward to learning about your next insights,
Christian



Bill Osmond  schrieb am Sa., 22. Aug. 2020, 16:31:

> Great e-mail messages to wake up to! Thank you for the further explanation
> Liam, and Christian the examples you provided were considerably faster:
>
> - my fastest was 70k ms
> - your ddex.xq was 35kms
> - your ddex2.xq was 10kms!
>
> There is only one issue: both ddex.xq and ddex2.xq seem to return many
> more results than expected (cartesian product somewhere perhaps)
>
> When I run the queries against a smaller database - one with just 6 of the
> DDEX documents, my query returns 70 results which matches the number of
> TrackReleases, but both ddex.xq and ddex2.xq return 303,134 results. It
> looks like a separate "copy" of the output is being created for every Party
> in the PartyList, when really there should be only one (specified by the
> PartyReference). But this is very promising - if it takes 10 seconds to
> return a massively expanded version of the data, then perhaps this will get
> to <1000ms!
>
> On Sat, Aug 22, 2020 at 4:07 AM Christian Grün 
> wrote:
>
>> Hi Bill,
>>
>> Feel free to run the attached queries; maybe they give you a faster
>> result.
>>
>> Your use case was interesting. It gave me some additional ideas on how
>> to speed up queries (by reordering consecutive 'for' clauses that do
>> not change the result).
>>
>> Cheers,
>> Christian
>>
>>
>> On Sat, Aug 22, 2020 at 6:10 AM Liam R. E. Quin 
>> wrote:
>> >
>> > On Fri, 2020-08-21 at 17:28 -0700, Bill Osmond wrote:
>> > > I'm beginning to think that perhaps my performance hopes were a bit
>> > > too
>> > > inflated, given the size and complexity of our database. After a
>> > > fresh
>> > > optimization, and with -Xms2g -Xmx10g, the following query takes
>> > > 1492ms:
>> >
>> > [...]
>> >
>> > First note - there are in fact no loops in your query. Although "for"
>> > is used to introduce a loop in many procedural languages, it does nto
>> > do so in XQuery (nor does for-each in XSLT).
>> >
>> > In fact, it's closer to what SQL people know as a join.
>> >
>> > It's making a stream of n-tuples, and then evaluating the inner
>> > expression for each tuple, so that
>> >
>> > for $a in (  'a', 'b', 'c')
>> >   for $b in (1 to 5)
>> > return $a || '-' || $b
>> >
>> > produces 15 lines of output,
>> > a-1, a-2, 1-3, a-4, a-6, b-1, and so on.
>> >
>> > You can see the BaseX query plan for your query already moves your
>> > where clauses as i did by hand, because BaseX is awesome.
>> >
>> > To make the query fast, you either need to reduce the number of tuples,
>> > and henve the number of times the expressions are evaluated, or you
>> > need to reduce the cost of creating the tuples.
>> >
>> > Moving the where clauses was my attempt to reduce the number of tuples.
>> > Adding an index might reduce the cost of making the tuples, so i'd
>> > certainly try that.
>> >
>> > If the input document is sorted, you might be able to construct
>> > something recursively (e.g. with fold-left) or use grouping or
>> > windowing to process $parties in groups, which may help considerably.
>> >
>> > Without seeing the data, that's only a guess.
>> >
>> > Liam
>> >
>> > --
>> > Liam Quin, https://www.delightfulcomputing.com/
>> > Available for XML/Document/Information Architecture/XSLT/
>> > XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
>> > Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org
>> >
>>
>
>


Re: [basex-talk] improving query performance

2020-08-22 Thread Bill Osmond
Great e-mail messages to wake up to! Thank you for the further explanation
Liam, and Christian the examples you provided were considerably faster:

- my fastest was 70k ms
- your ddex.xq was 35kms
- your ddex2.xq was 10kms!

There is only one issue: both ddex.xq and ddex2.xq seem to return many more
results than expected (cartesian product somewhere perhaps)

When I run the queries against a smaller database - one with just 6 of the
DDEX documents, my query returns 70 results which matches the number of
TrackReleases, but both ddex.xq and ddex2.xq return 303,134 results. It
looks like a separate "copy" of the output is being created for every Party
in the PartyList, when really there should be only one (specified by the
PartyReference). But this is very promising - if it takes 10 seconds to
return a massively expanded version of the data, then perhaps this will get
to <1000ms!

On Sat, Aug 22, 2020 at 4:07 AM Christian Grün 
wrote:

> Hi Bill,
>
> Feel free to run the attached queries; maybe they give you a faster result.
>
> Your use case was interesting. It gave me some additional ideas on how
> to speed up queries (by reordering consecutive 'for' clauses that do
> not change the result).
>
> Cheers,
> Christian
>
>
> On Sat, Aug 22, 2020 at 6:10 AM Liam R. E. Quin 
> wrote:
> >
> > On Fri, 2020-08-21 at 17:28 -0700, Bill Osmond wrote:
> > > I'm beginning to think that perhaps my performance hopes were a bit
> > > too
> > > inflated, given the size and complexity of our database. After a
> > > fresh
> > > optimization, and with -Xms2g -Xmx10g, the following query takes
> > > 1492ms:
> >
> > [...]
> >
> > First note - there are in fact no loops in your query. Although "for"
> > is used to introduce a loop in many procedural languages, it does nto
> > do so in XQuery (nor does for-each in XSLT).
> >
> > In fact, it's closer to what SQL people know as a join.
> >
> > It's making a stream of n-tuples, and then evaluating the inner
> > expression for each tuple, so that
> >
> > for $a in (  'a', 'b', 'c')
> >   for $b in (1 to 5)
> > return $a || '-' || $b
> >
> > produces 15 lines of output,
> > a-1, a-2, 1-3, a-4, a-6, b-1, and so on.
> >
> > You can see the BaseX query plan for your query already moves your
> > where clauses as i did by hand, because BaseX is awesome.
> >
> > To make the query fast, you either need to reduce the number of tuples,
> > and henve the number of times the expressions are evaluated, or you
> > need to reduce the cost of creating the tuples.
> >
> > Moving the where clauses was my attempt to reduce the number of tuples.
> > Adding an index might reduce the cost of making the tuples, so i'd
> > certainly try that.
> >
> > If the input document is sorted, you might be able to construct
> > something recursively (e.g. with fold-left) or use grouping or
> > windowing to process $parties in groups, which may help considerably.
> >
> > Without seeing the data, that's only a guess.
> >
> > Liam
> >
> > --
> > Liam Quin, https://www.delightfulcomputing.com/
> > Available for XML/Document/Information Architecture/XSLT/
> > XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
> > Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org
> >
>


Re: [basex-talk] improving query performance

2020-08-22 Thread Christian Grün
Hi Bill,

Feel free to run the attached queries; maybe they give you a faster result.

Your use case was interesting. It gave me some additional ideas on how
to speed up queries (by reordering consecutive 'for' clauses that do
not change the result).

Cheers,
Christian


On Sat, Aug 22, 2020 at 6:10 AM Liam R. E. Quin  wrote:
>
> On Fri, 2020-08-21 at 17:28 -0700, Bill Osmond wrote:
> > I'm beginning to think that perhaps my performance hopes were a bit
> > too
> > inflated, given the size and complexity of our database. After a
> > fresh
> > optimization, and with -Xms2g -Xmx10g, the following query takes
> > 1492ms:
>
> [...]
>
> First note - there are in fact no loops in your query. Although "for"
> is used to introduce a loop in many procedural languages, it does nto
> do so in XQuery (nor does for-each in XSLT).
>
> In fact, it's closer to what SQL people know as a join.
>
> It's making a stream of n-tuples, and then evaluating the inner
> expression for each tuple, so that
>
> for $a in (  'a', 'b', 'c')
>   for $b in (1 to 5)
> return $a || '-' || $b
>
> produces 15 lines of output,
> a-1, a-2, 1-3, a-4, a-6, b-1, and so on.
>
> You can see the BaseX query plan for your query already moves your
> where clauses as i did by hand, because BaseX is awesome.
>
> To make the query fast, you either need to reduce the number of tuples,
> and henve the number of times the expressions are evaluated, or you
> need to reduce the cost of creating the tuples.
>
> Moving the where clauses was my attempt to reduce the number of tuples.
> Adding an index might reduce the cost of making the tuples, so i'd
> certainly try that.
>
> If the input document is sorted, you might be able to construct
> something recursively (e.g. with fold-left) or use grouping or
> windowing to process $parties in groups, which may help considerably.
>
> Without seeing the data, that's only a guess.
>
> Liam
>
> --
> Liam Quin, https://www.delightfulcomputing.com/
> Available for XML/Document/Information Architecture/XSLT/
> XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
> Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org
>


ddex2.xq
Description: Binary data


ddex.xq
Description: Binary data


Re: [basex-talk] improving query performance

2020-08-21 Thread Liam R. E. Quin
On Fri, 2020-08-21 at 17:28 -0700, Bill Osmond wrote:
> I'm beginning to think that perhaps my performance hopes were a bit
> too
> inflated, given the size and complexity of our database. After a
> fresh
> optimization, and with -Xms2g -Xmx10g, the following query takes
> 1492ms:

[...]

First note - there are in fact no loops in your query. Although "for"
is used to introduce a loop in many procedural languages, it does nto
do so in XQuery (nor does for-each in XSLT).

In fact, it's closer to what SQL people know as a join.

It's making a stream of n-tuples, and then evaluating the inner
expression for each tuple, so that

for $a in (  'a', 'b', 'c')
  for $b in (1 to 5)
return $a || '-' || $b

produces 15 lines of output,
a-1, a-2, 1-3, a-4, a-6, b-1, and so on.

You can see the BaseX query plan for your query already moves your
where clauses as i did by hand, because BaseX is awesome.

To make the query fast, you either need to reduce the number of tuples,
and henve the number of times the expressions are evaluated, or you
need to reduce the cost of creating the tuples.

Moving the where clauses was my attempt to reduce the number of tuples.
Adding an index might reduce the cost of making the tuples, so i'd
certainly try that.

If the input document is sorted, you might be able to construct
something recursively (e.g. with fold-left) or use grouping or
windowing to process $parties in groups, which may help considerably.

Without seeing the data, that's only a guess.

Liam

-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org



Re: [basex-talk] improving query performance

2020-08-21 Thread Bill Osmond
I'm beginning to think that perhaps my performance hopes were a bit too
inflated, given the size and complexity of our database. After a fresh
optimization, and with -Xms2g -Xmx10g, the following query takes 1492ms:

declare namespace ernm="http://ddex.net/xml/ern/411;;
for $r in /ernm:NewReleaseMessage
  for $track_release in $r/ReleaseList/TrackRelease
return
  
{ $track_release/ReleaseId/ISRC/text() }
  

When I add a little bit more to the query, we go up to 11204ms:

declare namespace ernm="http://ddex.net/xml/ern/411;;
for $r in /ernm:NewReleaseMessage
  let $parties := $r/PartyList/Party
  for $track_release in $r/ReleaseList/TrackRelease
let $rlr := $track_release/ReleaseLabelReference/text()
let $party := $parties[PartyReference/text() = $rlr]
return
  
{ $track_release/ReleaseId/ISRC/text() }
{ $party/PartyName/FullName/text() }
  

The fastest full query that I've been able to make so far takes 70450ms:

declare namespace ernm="http://ddex.net/xml/ern/411;;
for $r in /ernm:NewReleaseMessage
  let $parties := $r/PartyList/Party
  let $sound_recordings := $r/ResourceList/SoundRecording
  let $releases := $r/ReleaseList/Release
  for $track_release in $r/ReleaseList/TrackRelease
let $rrr := $track_release/ReleaseResourceReference/text()
let $rlr := $track_release/ReleaseLabelReference/text()
let $sound_recording := $sound_recordings[ResourceReference/text() =
$rrr]
let $release :=
$releases[ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference/text()
= $rrr]
let $party := $parties[PartyReference/text() = $rlr]
return
  
{ $track_release/ReleaseId/ISRC/text() }
{ fn:string-join($sound_recording/DisplayArtistName/text(),
'/') }
{ $sound_recording/DisplayTitleText/text() }
{ $release/DisplayTitleText/text() }
{ $release/ReleaseId/ICPN/text() }
{ $party/PartyName/FullName/text() }
  

Still, a significant improvement over where I started! Running this query
against the actual remote server, and retrieving the full result set, is
down to 3:52, from ~15 minutes. As this is a batch process that will run
every hour, that performance is adequate. It does seem to be heavily CPU
bound: when running any of these queries, there's always a single core
sitting at 100% utilization while the rest of them are idle.

Thanks for the help!
Bill


Re: [basex-talk] improving query performance

2020-08-21 Thread Christian Grün
Just a quick hint: As the query info output indicates that no text
index is used (see [1]), you could try to attach explicit text() steps
in your comparisons:

where $track_release/ReleaseLabelReference/text() =
$r/PartyList/Party/PartyReference/text()
...
where $track_release/ReleaseResourceReference/text() =
$sound_recording/ResourceReference/text()
...

Feel free to provide us with a minimized XML document that allows us
to run your query.

[1] https://docs.basex.org/wiki/Indexes



On Fri, Aug 21, 2020 at 11:26 PM Bill Osmond  wrote:
>
> The indentation levels and multiple returns in your suggested query alone has 
> helped me in understanding what's actually going on, and I've rewritten the 
> query mostly as you suggest:
>
> for $r in /ernm:NewReleaseMessage
>   for $track_release in $r/ReleaseList/TrackRelease
>   where $track_release/ReleaseLabelReference = 
> $r/PartyList/Party/PartyReference
>   return
> for $party in $r/PartyList/Party
>   for $sound_recording in $r/ResourceList/SoundRecording
>   where $track_release/ReleaseResourceReference = 
> $sound_recording/ResourceReference
>   return
> for $release in $r/ReleaseList/Release
> where $track_release/ReleaseResourceReference = 
> $release/ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
> return
>   
> { $track_release/ReleaseId/ISRC/text() }
> { fn:string-join($sound_recording/DisplayArtistName, '/') 
> }
> { $sound_recording/DisplayTitleText/text() }
> { $release/DisplayTitleText/text() }
> { $release/ReleaseId/ICPN/text() }
> { $party/PartyName/FullName/text() }
>   
>
> I say "mostly" because at the first 'where' statement, I had to replace the 
> $party/PartyReference on the right side of the = with an explicit reference 
> from $r, because $party isn't defined until 2 lines further down. The query 
> does work and appears to be giving the right results, however the performance 
> is basically the same: 127,698.27 ms for this run vs. 114,935.22 ms for the 
> original. I'll try shuffling things around next!
>
> On Fri, Aug 21, 2020 at 1:52 PM Bill Osmond  wrote:
>>
>> Thank you both! I was in the middle of typing a response to Bridger when 
>> this came through, so while I digest both of your messages & test index 
>> addition/moving things around, I'll leave the output from the GUI info panel 
>> here - if I knew how to read it, it would no doubt point straight at the 
>> issue! I'll note that it runs much quicker locally as opposed to on 
>> server+via the python BaseXClient, but I don't know if that's because the 
>> GUI is stopping at 500k results, or what.
>>
>> Compiling:
>> - rewrite context value to document-node() sequence: . -> 
>> (db:open-pre("umg-118061851424", 0), ...)
>> - rewrite util:root(nodes) to document-node() sequence: 
>> util:root((db:open-pre("umg-118061851424", 0), ...)) -> 
>> (db:open-pre("umg-118061851424", 0), ...)
>> - move where clause: $track_release_1/ReleaseLabelReference = 
>> $party_2/PartyReference
>> - swap operands: PartyReference = $track_release_1/ReleaseLabelReference
>> - rewrite to predicate: PartyReference = 
>> $track_release_1/ReleaseLabelReference
>> - move where clause: $track_release_1/ReleaseResourceReference = 
>> $sound_recording_3/ResourceReference
>> - swap operands: ResourceReference = 
>> $track_release_1/ReleaseResourceReference
>> - rewrite to predicate: ResourceReference = 
>> $track_release_1/ReleaseResourceReference
>> - swap operands: 
>> ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
>>  = $track_release_1/Rel...
>> - rewrite to predicate: 
>> ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
>>  = $track_release_1/Rel...
>> - inline for $release_4 in 
>> $r_0/ReleaseList/Release[ResourceGroup/ResourceGroup/ResourceGroupContentItem/Rele...
>>
>> Optimized Query:
>> for $r_0 in (db:open-pre("umg-118061851424", 0), ...)/ernm:NewReleaseMessage 
>> for $track_release_1 in $r_0/ReleaseList/TrackRelease for $party_2 in 
>> $r_0/PartyList/Party[PartyReference = 
>> $track_release_1/ReleaseLabelReference] for $sound_recording_3 in 
>> $r_0/ResourceList/SoundRecording[ResourceReference = 
>> $track_release_1/ReleaseResourceReference] return 
>> $r_0/ReleaseList/Release[ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
>>  = $track_release_1/ReleaseResourceReference] !   { 
>> $track_release_1/ReleaseId/ISRC/text() }   { 
>> string-join($sound_recording_3/DisplayArtistName, "/") }   { 
>> $sound_recording_3/DisplayTitleText/text() }   { 
>> DisplayTitleText/text() }   { ReleaseId/ICPN/text() }  
>>  { $party_2/PartyName/FullName/text() }  
>>
>> Query:
>> declare namespace ernm="http://ddex.net/xml/ern/411;; for $r in 
>> /ernm:NewReleaseMessage for $track_release in $r/ReleaseList/TrackRelease 
>> for $party in $r/PartyList/Party for $sound_recording in 
>> $r/ResourceList/SoundRecording for $release in 

Re: [basex-talk] improving query performance

2020-08-21 Thread Bill Osmond
The indentation levels and multiple returns in your suggested query alone
has helped me in understanding what's actually going on, and I've rewritten
the query mostly as you suggest:

for $r in /ernm:NewReleaseMessage
  for $track_release in $r/ReleaseList/TrackRelease
  where $track_release/ReleaseLabelReference =
$r/PartyList/Party/PartyReference
  return
for $party in $r/PartyList/Party
  for $sound_recording in $r/ResourceList/SoundRecording
  where $track_release/ReleaseResourceReference =
$sound_recording/ResourceReference
  return
for $release in $r/ReleaseList/Release
where $track_release/ReleaseResourceReference =
$release/ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
return
  
{ $track_release/ReleaseId/ISRC/text() }
{ fn:string-join($sound_recording/DisplayArtistName, '/')
}
{ $sound_recording/DisplayTitleText/text() }
{ $release/DisplayTitleText/text() }
{ $release/ReleaseId/ICPN/text() }
{ $party/PartyName/FullName/text() }
  

I say "mostly" because at the first 'where' statement, I had to replace the
$party/PartyReference on the right side of the = with an explicit reference
from $r, because $party isn't defined until 2 lines further down. The query
does work and appears to be giving the right results, however the
performance is basically the same: 127,698.27 ms for this run vs.
114,935.22 ms for the original. I'll try shuffling things around next!

On Fri, Aug 21, 2020 at 1:52 PM Bill Osmond  wrote:

> Thank you both! I was in the middle of typing a response to Bridger when
> this came through, so while I digest both of your messages & test index
> addition/moving things around, I'll leave the output from the GUI info
> panel here - if I knew how to read it, it would no doubt point straight at
> the issue! I'll note that it runs much quicker locally as opposed to on
> server+via the python BaseXClient, but I don't know if that's because the
> GUI is stopping at 500k results, or what.
>
> Compiling:
> - rewrite context value to document-node() sequence: . ->
> (db:open-pre("umg-118061851424", 0), ...)
> - rewrite util:root(nodes) to document-node() sequence:
> util:root((db:open-pre("umg-118061851424", 0), ...)) ->
> (db:open-pre("umg-118061851424", 0), ...)
> - move where clause: $track_release_1/ReleaseLabelReference =
> $party_2/PartyReference
> - swap operands: PartyReference = $track_release_1/ReleaseLabelReference
> - rewrite to predicate: PartyReference =
> $track_release_1/ReleaseLabelReference
> - move where clause: $track_release_1/ReleaseResourceReference =
> $sound_recording_3/ResourceReference
> - swap operands: ResourceReference =
> $track_release_1/ReleaseResourceReference
> - rewrite to predicate: ResourceReference =
> $track_release_1/ReleaseResourceReference
> - swap operands:
> ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
> = $track_release_1/Rel...
> - rewrite to predicate:
> ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
> = $track_release_1/Rel...
> - inline for $release_4 in
> $r_0/ReleaseList/Release[ResourceGroup/ResourceGroup/ResourceGroupContentItem/Rele...
>
> Optimized Query:
> for $r_0 in (db:open-pre("umg-118061851424", 0),
> ...)/ernm:NewReleaseMessage for $track_release_1 in
> $r_0/ReleaseList/TrackRelease for $party_2 in
> $r_0/PartyList/Party[PartyReference =
> $track_release_1/ReleaseLabelReference] for $sound_recording_3 in
> $r_0/ResourceList/SoundRecording[ResourceReference =
> $track_release_1/ReleaseResourceReference] return
> $r_0/ReleaseList/Release[ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
> = $track_release_1/ReleaseResourceReference] !   {
> $track_release_1/ReleaseId/ISRC/text() }   {
> string-join($sound_recording_3/DisplayArtistName, "/") }  
> { $sound_recording_3/DisplayTitleText/text() }   {
> DisplayTitleText/text() }   { ReleaseId/ICPN/text() } 
>  { $party_2/PartyName/FullName/text() }  
>
> Query:
> declare namespace ernm="http://ddex.net/xml/ern/411;; for $r in
> /ernm:NewReleaseMessage for $track_release in $r/ReleaseList/TrackRelease
> for $party in $r/PartyList/Party for $sound_recording in
> $r/ResourceList/SoundRecording for $release in $r/ReleaseList/Release where
> $track_release/ReleaseLabelReference = $party/PartyReference and
> $track_release/ReleaseResourceReference =
> $sound_recording/ResourceReference and
> $track_release/ReleaseResourceReference =
> $release/ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
> return  { $track_release/ReleaseId/ISRC/text() }
> { fn:string-join($sound_recording/DisplayArtistName, '/')
> } { $sound_recording/DisplayTitleText/text() }
> { $release/DisplayTitleText/text() } {
> $release/ReleaseId/ICPN/text() } {
> $party/PartyName/FullName/text() } 
>
> Result:
> - Hit(s): 50 Items
> - Updated: 0 Items
> - Printed: 8192 kB
> - Read Locking: umg-118061851424
> - Write 

Re: [basex-talk] improving query performance

2020-08-21 Thread Bill Osmond
Thank you both! I was in the middle of typing a response to Bridger when
this came through, so while I digest both of your messages & test index
addition/moving things around, I'll leave the output from the GUI info
panel here - if I knew how to read it, it would no doubt point straight at
the issue! I'll note that it runs much quicker locally as opposed to on
server+via the python BaseXClient, but I don't know if that's because the
GUI is stopping at 500k results, or what.

Compiling:
- rewrite context value to document-node() sequence: . ->
(db:open-pre("umg-118061851424", 0), ...)
- rewrite util:root(nodes) to document-node() sequence:
util:root((db:open-pre("umg-118061851424", 0), ...)) ->
(db:open-pre("umg-118061851424", 0), ...)
- move where clause: $track_release_1/ReleaseLabelReference =
$party_2/PartyReference
- swap operands: PartyReference = $track_release_1/ReleaseLabelReference
- rewrite to predicate: PartyReference =
$track_release_1/ReleaseLabelReference
- move where clause: $track_release_1/ReleaseResourceReference =
$sound_recording_3/ResourceReference
- swap operands: ResourceReference =
$track_release_1/ReleaseResourceReference
- rewrite to predicate: ResourceReference =
$track_release_1/ReleaseResourceReference
- swap operands:
ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
= $track_release_1/Rel...
- rewrite to predicate:
ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
= $track_release_1/Rel...
- inline for $release_4 in
$r_0/ReleaseList/Release[ResourceGroup/ResourceGroup/ResourceGroupContentItem/Rele...

Optimized Query:
for $r_0 in (db:open-pre("umg-118061851424", 0),
...)/ernm:NewReleaseMessage for $track_release_1 in
$r_0/ReleaseList/TrackRelease for $party_2 in
$r_0/PartyList/Party[PartyReference =
$track_release_1/ReleaseLabelReference] for $sound_recording_3 in
$r_0/ResourceList/SoundRecording[ResourceReference =
$track_release_1/ReleaseResourceReference] return
$r_0/ReleaseList/Release[ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
= $track_release_1/ReleaseResourceReference] !   {
$track_release_1/ReleaseId/ISRC/text() }   {
string-join($sound_recording_3/DisplayArtistName, "/") }  
{ $sound_recording_3/DisplayTitleText/text() }   {
DisplayTitleText/text() }   { ReleaseId/ICPN/text() } 
 { $party_2/PartyName/FullName/text() }  

Query:
declare namespace ernm="http://ddex.net/xml/ern/411;; for $r in
/ernm:NewReleaseMessage for $track_release in $r/ReleaseList/TrackRelease
for $party in $r/PartyList/Party for $sound_recording in
$r/ResourceList/SoundRecording for $release in $r/ReleaseList/Release where
$track_release/ReleaseLabelReference = $party/PartyReference and
$track_release/ReleaseResourceReference =
$sound_recording/ResourceReference and
$track_release/ReleaseResourceReference =
$release/ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
return  { $track_release/ReleaseId/ISRC/text() }
{ fn:string-join($sound_recording/DisplayArtistName, '/')
} { $sound_recording/DisplayTitleText/text() }
{ $release/DisplayTitleText/text() } {
$release/ReleaseId/ICPN/text() } {
$party/PartyName/FullName/text() } 

Result:
- Hit(s): 50 Items
- Updated: 0 Items
- Printed: 8192 kB
- Read Locking: umg-118061851424
- Write Locking: (none)

Timing:
- Parsing: 2.88 ms
- Compiling: 4.38 ms
- Evaluating: 114935.22 ms
- Printing: 114.59 ms
- Total Time: 115057.07 ms

Query Plan:

  

  

  
  
  
  
  


  


  



  


  



  

  


  
  

  

  


  



  

  


  
  

  

  


  



  

  
  
  
  


  
  

  

  
  
identity

  isrc
  




  


  artist
  

  
  

/
  


  title
  



  


  album
  


  


  icpn
  



  


  sublabel
  




  

  

  


On Fri, Aug 21, 2020 at 1:48 PM Liam R. E. Quin 
wrote:

> On Fri, 2020-08-21 at 12:51 -0700, Bill Osmond wrote:
> >
> > 

Re: [basex-talk] improving query performance

2020-08-21 Thread Liam R. E. Quin
On Fri, 2020-08-21 at 12:51 -0700, Bill Osmond wrote:
> 
> declare namespace ernm="http://ddex.net/xml/ern/411;;
> 
> for $r in /ernm:NewReleaseMessage
> for $track_release in $r/ReleaseList/TrackRelease
> for $party in $r/PartyList/Party
> for $sound_recording in $r/ResourceList/SoundRecording
> for $release in $r/ReleaseList/Release
> where
>   $track_release/ReleaseLabelReference = $party/PartyReference
>   and $track_release/ReleaseResourceReference =
> $sound_recording/ResourceReference
>   and $track_release/ReleaseResourceReference =
> $release/ResourceGroup/ResourceGroup/ResourceGroupContentItem/Release
> ResourceReference

BaseX is probably smart enough to rewrite this, but check -

for $r in /ernm:NewReleaseMessage
 for $track_release in $r/ReleaseList/TrackRelease
 where $track_release/ReleaseLabelReference = $party/PartyReference

 return
   for $party in $r/PartyList/Party
for $sound_recording in $r/ResourceList/SoundRecording
where $track_release/ReleaseResourceReference =
  $sound_recording/ResourceReference
return
  for $release in $r/ReleaseList/Release
  where
$track_release/ReleaseResourceReference =
$release/ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseRe
sourceReference 
return
   ...

> Am I wrong, and would an additional value index help here? Or is my
> query
> just bad?

You're computing every possible combiation of 5 items and then
filtering out the ones you want.

Filtering out earlier would probably help. Also, moving the tests least
lilely to match to the outside would reduce the number of tests sooner.

A value index might well help, but as Bridger wrote, check in the gUI
to see the query plan. BaseX might already be doing the sort of rewrite
i suggested.

Liam


-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org



Re: [basex-talk] improving query performance

2020-08-21 Thread Bridger Dyson-Smith
Hi Bill,

I won't claim to be any kind of query expert, but there are some things you
can try to experiment with query speeds:

On Fri, Aug 21, 2020 at 3:52 PM Bill Osmond  wrote:

> Hi all,
> I'm attempting to query a fairly large database, with 136,938 resources
> and a size of 12,257,686,099. The basex server itself is an AWS EC2
> instance with 4 cores & 16 gigs of ram, using -Xmx12g. The database
> contains audio DDEX information which (to me at least) is fairly
> complicated XML. The query I'm running against the database works and gives
> me the result I want, but takes upwards of 15 minutes to execute. I've
> taken a look at the wiki documentation regarding indexes, but it's not
> clear to me that any of the non-default indexes would help for the query
> I'm running. That query is:
>
> declare namespace ernm="http://ddex.net/xml/ern/411;;
>
> for $r in /ernm:NewReleaseMessage
> for $track_release in $r/ReleaseList/TrackRelease
> for $party in $r/PartyList/Party
> for $sound_recording in $r/ResourceList/SoundRecording
> for $release in $r/ReleaseList/Release
> where
>   $track_release/ReleaseLabelReference = $party/PartyReference
>   and $track_release/ReleaseResourceReference =
> $sound_recording/ResourceReference
>   and $track_release/ReleaseResourceReference =
> $release/ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
> return
>   
> { $track_release/ReleaseId/ISRC/text() }
> { fn:string-join($sound_recording/DisplayArtistName, '/')
> }
> { $sound_recording/DisplayTitleText/text() }
> { $release/DisplayTitleText/text() }
> { $release/ReleaseId/ICPN/text() }
> { $party/PartyName/FullName/text() }
>   
>
> Am I wrong, and would an additional value index help here? Or is my query
> just bad?
>
> It looks like you're trying to use text values -- do you have a full range
of indexes in your database? (text, attribute, token, and maybe full text)

Also, have you tried running this query against a local instance of the
database? The GUI can help steer you towards query optimizations in the
Info window.


> Thanks,
> Bill
>
> p.s. The BaseX GUI client is awesome - especially the treemap view. Really
> helps with wrangling these XML files!
>

I'm crunched for time at the moment so please excuse the hasty response.

Best,
Bridger


[basex-talk] improving query performance

2020-08-21 Thread Bill Osmond
Hi all,
I'm attempting to query a fairly large database, with 136,938 resources and
a size of 12,257,686,099. The basex server itself is an AWS EC2 instance
with 4 cores & 16 gigs of ram, using -Xmx12g. The database contains audio
DDEX information which (to me at least) is fairly complicated XML. The
query I'm running against the database works and gives me the result I
want, but takes upwards of 15 minutes to execute. I've taken a look at the
wiki documentation regarding indexes, but it's not clear to me that any of
the non-default indexes would help for the query I'm running. That query is:

declare namespace ernm="http://ddex.net/xml/ern/411;;

for $r in /ernm:NewReleaseMessage
for $track_release in $r/ReleaseList/TrackRelease
for $party in $r/PartyList/Party
for $sound_recording in $r/ResourceList/SoundRecording
for $release in $r/ReleaseList/Release
where
  $track_release/ReleaseLabelReference = $party/PartyReference
  and $track_release/ReleaseResourceReference =
$sound_recording/ResourceReference
  and $track_release/ReleaseResourceReference =
$release/ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference
return
  
{ $track_release/ReleaseId/ISRC/text() }
{ fn:string-join($sound_recording/DisplayArtistName, '/')
}
{ $sound_recording/DisplayTitleText/text() }
{ $release/DisplayTitleText/text() }
{ $release/ReleaseId/ICPN/text() }
{ $party/PartyName/FullName/text() }
  

Am I wrong, and would an additional value index help here? Or is my query
just bad?

Thanks,
Bill

p.s. The BaseX GUI client is awesome - especially the treemap view. Really
helps with wrangling these XML files!