Re: Query with spatial and text searches.

Andy Seaborne Wed, 23 Dec 2015 09:04:19 -0800

Hi Mark,

Tricky.

There isn't a good way to turn off or modify optimization for parts of aquery without affecting the whole query. Jena 3.0.1 had a combinationof changes - hash join but also stronger flattening queries into theform you don't want for the first part.


The best I have come up with is:
(no special flags needed)


SELECT ?score ?ent
WHERE {
  { SELECT ?ent { ?ent spatial:nearby "ABC" . } OFFSET 0 }
  { SELECT ?ent { ?ent  text:query "DEF" . }  OFFSET 0 }
   ... rest of query ...

  }

i.e.

SELECT  ?score ?ent ?entLabel ?lat ?long ?point ?pointType ?pointLabel
WHERE {
    { SELECT ?ent {

?ent spatial:nearby(51.507999420166016 -0.1099999994039535570.8018078804016'km') .

         } OFFSET 0 }
    { SELECT ?ent {
        (?ent ?score) text:query ('environment' 'lang:en') .
        } OFFSET 0 }

    ?ent rdf:type iotic:Entity

OPTIONAL {
    ?ent rdfs:label ?entLabel .
    FILTER langMatches( lang(?entLabel), 'en' ) .
    }

    OPTIONAL {?ent geo:lat ?lat . ?ent geo:long ?long}
    ?ent iotic:Advertises ?point .
    ?point rdf:type iotic:Point .
    ?point iotic:PointType ?pointType .

OPTIONAL {
    ?point rdfs:label ?pointLabel .
    FILTER langMatches( lang(?pointLabel), 'en' ) .
    }
}


On 23/12/15 11:03, Mark Wharton wrote:

Hi Andy.

More experiments this morning.  I originally only send you a small part
of a larger query just to expose the problem in its simplest form.  And
your switches work well in that case (i.e. first formulation below
*with* the comments.)

But... There's a problem when using the switches in that the rest of the
query wants to get the rdfs:label and various other properties.  This
destroys the performance gains.

I've tried "yours" and "mine" with and without the switches and then the
separate parts on their own to see how that goes.

1) "yours"
==========
This formulation (with the switches and comments in place) - 384 ms

SELECT  ?score ?ent ?entLabel ?lat ?long ?point ?pointType ?pointLabel
WHERE {
    { ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
70.8018078804016'km') }
    { (?ent ?score) text:query ('environment' 'lang:en') .FILTER EXISTS
{?ent rdf:type iotic:Entity} }

#    OPTIONAL {
#        ?ent rdfs:label ?entLabel .
#        FILTER langMatches( lang(?entLabel), 'en' ) .
#        }
#
#    OPTIONAL {?ent geo:lat ?lat . ?ent geo:long ?long}
#    ?ent iotic:Advertises ?point .
#    ?point rdf:type iotic:Point .
#    ?point iotic:PointType ?pointType .
#
#    OPTIONAL {
#       ?point rdfs:label ?pointLabel .
#       FILTER langMatches( lang(?pointLabel), 'en' ) .
#       }

}

Uncomment the lines and the performance drops to - 7.165 ms

2) "mine"
=========
The below formulation with the switches in place 11.221 secs
The below without the switches. 5.371 secs

SELECT  ?score ?ent ?entLabel ?lat ?long ?point ?pointType ?pointLabel
WHERE {
     ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
70.8018078804016'km') .
     (?ent ?score) text:query ('environment' 'lang:en') .FILTER EXISTS
{?ent rdf:type iotic:Entity}  .

OPTIONAL {
     ?ent rdfs:label ?entLabel .
     FILTER langMatches( lang(?entLabel), 'en' ) .
     }

     OPTIONAL {?ent geo:lat ?lat . ?ent geo:long ?long}
     ?ent iotic:Advertises ?point .
     ?point rdf:type iotic:Point .
     ?point iotic:PointType ?pointType .

OPTIONAL {
     ?point rdfs:label ?pointLabel .
     FILTER langMatches( lang(?pointLabel), 'en' ) .
     }

}

3) Separately
==============
Completely on their own:
========================
i.e. just the ?ent spatial:nearby line
the spatial query on its own takes 50 ms
i.e just the text:query line
and the text on its own takes 258 ms

With the OPTIONAL {} and other properties
=========================================
Spatial and other properties 135 ms
Text and other properties 854 ms

Again, repeated thanks for you help.

Mark

Technology Lead, Iotic Labs
[email protected]
https://www.iotic-labs.com

On 22/12/15 17:22, Andy Seaborne wrote:

Mark,

Thanks for the experiment results.

On 22/12/15 15:47, Mark Wharton wrote:

Query below run without Andy's switches.
   INFO  [5] 200 OK (4.985 s)

Query below run with Andy's switches.
   INFO  [1] 200 OK (840 ms)

Them's some magic switches.  Thanks, Andy.

Do they have any impact (negative or positive) on any other SPARQL
operations?  I'm only curious as you've solved our main problem in that
our "search" query was very slow.  There's nowhere else that uses the
text and spatial indexes for retrieval.


This depends on any internal change in the latest release (Jena 3.0.1,
Fuseki 2.3.1). Prior to that it will not make the same difference.
Specially, unoptimized joins are now hash joins.

But that is not a good choice for the "?ent rdf:type iotic:Entity"
triple pattern.  The system can't distinguish different cases involving
external indexes as it knows not very much about the index details.

Adding

FILTER EXISTS { ?ext rdf:type iotic:Entity }

might work because the triple pattern is really a check, not a match
setting a variable.

A plain "?ent rdf:type iotic:Entity" will retrieve all things of that
class regardless of spatial and text query when those optimization are off.

     Andy


Many thanks for this help so close to the holiday season.  Happy
holidays to you all at Jena - keep up the good work.

Mark


Technology Lead, Iotic Labs
+44 7973 674404
[email protected]
https://www.iotic-labs.com

On 22/12/15 11:49, Andy Seaborne wrote:

Mark - here is another way.

This query:

SELECT ?score ?ent
WHERE {
     { ?ent spatial:nearby ( .... ) }
     { ?ent text:query ( ..... ) }
     # No ?ent rdf:type iotic:Entity .
     # This focuses the query on the presenting issue.
}

and then run Fuseki with the following flags:

    --set arq:optIndexJoinStrategy=false --set arq:optMergeBGPs=false

for however you are running the server.

You need both --set

The service script will not do this very easily - if environment
variable FUSEKI_ARGS is set it might do. Untested.

It is easier to run the server standalone:

(Linux, Mac)

The "fuseki-server" script should pass these in:

fuseki-server \
    --set arq:optIndexJoinStrategy=false --set arq:optMergeBGPs=false \
    .. other args ..

(Windows or any platform)

You can call the server java code directly: all one line:


java -Xmx1200M -jar fuseki-server.jar --set
arq:optIndexJoinStrategy=false --set arq:optMergeBGPs=false .. other
args ..

you'll need to put the full path name of fuseki-server.jar

Sorry - I don't have your setup to test this fully. I did make sure that
the reworked query does lead to an execution plan that is different and
should yield some information about the situation.

      Andy

On 22/12/15 09:50, Andy Seaborne wrote:

On 22/12/15 07:06, Mark Wharton wrote:

Ah, wheels within wheels.

The formulation with the filter in it is fine, except that if you want
to search for more than one word or you match in label and comment
then
the UNION formulation returns you duplicate rows.  This isn't a
problem
with the Lucene search which is why (I now remember) I used it in the
first place.

I'm not sure what version of jena I'm using - I just use the fuseki
release at 2.3.0.  Is there a way to find out?


3.0.0

Many of the java commands support --version and the fuseki- server jar
is an all-in-one jar:

java -cp <YourInstall>/fuseki-server.jar arq.sparql --version

What's the status on the JENA-999 and JENA-1093 issues?  I see there's
been some activity on 999 in the last few days. Andy Seaborne's last
comment seems encouraging.

I don't want to adopt a single version as I'll be stuck forever
patching
back and forward and it will break eventually.

Many thanks for your continued help.


JENA-999 may sort of help but I'm not that positive because each ?ent
from the first part will be different going into the second part.  It
looks to me as if it is the overhead of going out to Lucene. (This is
Lucene right? not Solr?)

The ideal is some super compilation of the text:query and spatial query
into one big Lucene query.

What would also be good, which is stop the general optimizer (this is
nothing to do with TDB) using an index join.  Except that is the better
choice for the rdf:type.  This is what the addition {} were trying for
except the optimizer outsmarted

SELECT ?score ?ent
WHERE {
    ?ent spatial:nearby( ...) .
    (?ent ?score) text:query (...) .
    ?ent rdf:type iotic:Entity .
}



Mark - can you ask the query from Java?  If so,

Add  "Optimize.noOptimizer(); " before executing the query.  I can't
see
a way to do that from setting the environment for Fuseki.

Or (the effect on time of this is version specific and whether it does
anything useful is a big "maybe") you could try this:

SELECT ?score ?ent
WHERE {
    { OPTIONAL { ?ent spatial:nearby "ABC" . }}
    { OPTIONAL { ?ent  text:query "DEF" } }
}

       Andy

Re: Query with spatial *and* text searches.

Reply via email to

Re: Query with spatial and text searches.