Hi Andy

Thanks for your help.  I've integrated those changes into the main
search query and it's about 10x faster.

Just one last question...  What does the OFFSET 0 bit do in the query -
force it to use the spatial/text index in preference to any other?

Mark

Technology Lead, Iotic Labs
+44 7973 674404
[email protected]
https://www.iotic-labs.com

On 24/12/15 05:38, Mark Wharton wrote:
> Hi Andy.
> 
> That's cracked it.  I was wondering about the sub-select route, but
> wasn't sure how to code the intersection part.  I just tweaked it to
> return the score from the text query
> 
> Your formulation
> 200 OK (231 ms)
> 
> That's 200 OK by me...
> 
> Enjoy the holidays
> 
> Mark
> 
> Technology Lead, Iotic Labs
> +44 7973 674404
> [email protected]
> https://www.iotic-labs.com
> 
> On 23/12/15 17:03, Andy Seaborne wrote:
>> Hi Mark,
>>
>> Tricky.
>>
>> There isn't a good way to turn off or modify optimization for parts of a
>> query without affecting the whole query.  Jena 3.0.1 had a combination
>> of changes - hash join but also stronger flattening queries into the
>> form you don't want for the first part.
>>
>> The best I have come up with is:
>> (no special flags needed)
>>
>>
>> SELECT ?score ?ent
>> WHERE {
>>   { SELECT ?ent { ?ent spatial:nearby "ABC" . } OFFSET 0 }
>>   { SELECT ?ent { ?ent  text:query "DEF" . }  OFFSET 0 }
>>    ... rest of query ...
>>
>>   }
>>
>> i.e.
>>
>> SELECT  ?score ?ent ?entLabel ?lat ?long ?point ?pointType ?pointLabel
>> WHERE {
>>     { SELECT ?ent {
>>         ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>> 70.8018078804016'km') .
>>          } OFFSET 0 }
>>     { SELECT ?ent {
>>         (?ent ?score) text:query ('environment' 'lang:en') .
>>         } OFFSET 0 }
>>
>>     ?ent rdf:type iotic:Entity
>>
>> OPTIONAL {
>>     ?ent rdfs:label ?entLabel .
>>     FILTER langMatches( lang(?entLabel), 'en' ) .
>>     }
>>
>>     OPTIONAL {?ent geo:lat ?lat . ?ent geo:long ?long}
>>     ?ent iotic:Advertises ?point .
>>     ?point rdf:type iotic:Point .
>>     ?point iotic:PointType ?pointType .
>>
>> OPTIONAL {
>>     ?point rdfs:label ?pointLabel .
>>     FILTER langMatches( lang(?pointLabel), 'en' ) .
>>     }
>> }
>>
>>
>> On 23/12/15 11:03, Mark Wharton wrote:
>>> Hi Andy.
>>>
>>> More experiments this morning.  I originally only send you a small part
>>> of a larger query just to expose the problem in its simplest form.  And
>>> your switches work well in that case (i.e. first formulation below
>>> *with* the comments.)
>>>
>>> But... There's a problem when using the switches in that the rest of the
>>> query wants to get the rdfs:label and various other properties.  This
>>> destroys the performance gains.
>>>
>>> I've tried "yours" and "mine" with and without the switches and then the
>>> separate parts on their own to see how that goes.
>>>
>>> 1) "yours"
>>> ==========
>>> This formulation (with the switches and comments in place) - 384 ms
>>>
>>> SELECT  ?score ?ent ?entLabel ?lat ?long ?point ?pointType ?pointLabel
>>> WHERE {
>>>     { ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>>> 70.8018078804016'km') }
>>>     { (?ent ?score) text:query ('environment' 'lang:en') .FILTER EXISTS
>>> {?ent rdf:type iotic:Entity} }
>>>
>>> #    OPTIONAL {
>>> #        ?ent rdfs:label ?entLabel .
>>> #        FILTER langMatches( lang(?entLabel), 'en' ) .
>>> #        }
>>> #
>>> #    OPTIONAL {?ent geo:lat ?lat . ?ent geo:long ?long}
>>> #    ?ent iotic:Advertises ?point .
>>> #    ?point rdf:type iotic:Point .
>>> #    ?point iotic:PointType ?pointType .
>>> #
>>> #    OPTIONAL {
>>> #       ?point rdfs:label ?pointLabel .
>>> #       FILTER langMatches( lang(?pointLabel), 'en' ) .
>>> #       }
>>>
>>> }
>>>
>>> Uncomment the lines and the performance drops to - 7.165 ms
>>>
>>> 2) "mine"
>>> =========
>>> The below formulation with the switches in place 11.221 secs
>>> The below without the switches. 5.371 secs
>>>
>>> SELECT  ?score ?ent ?entLabel ?lat ?long ?point ?pointType ?pointLabel
>>> WHERE {
>>>      ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>>> 70.8018078804016'km') .
>>>      (?ent ?score) text:query ('environment' 'lang:en') .FILTER EXISTS
>>> {?ent rdf:type iotic:Entity}  .
>>>
>>> OPTIONAL {
>>>      ?ent rdfs:label ?entLabel .
>>>      FILTER langMatches( lang(?entLabel), 'en' ) .
>>>      }
>>>
>>>      OPTIONAL {?ent geo:lat ?lat . ?ent geo:long ?long}
>>>      ?ent iotic:Advertises ?point .
>>>      ?point rdf:type iotic:Point .
>>>      ?point iotic:PointType ?pointType .
>>>
>>> OPTIONAL {
>>>      ?point rdfs:label ?pointLabel .
>>>      FILTER langMatches( lang(?pointLabel), 'en' ) .
>>>      }
>>>
>>> }
>>>
>>> 3) Separately
>>> ==============
>>> Completely on their own:
>>> ========================
>>> i.e. just the ?ent spatial:nearby line
>>> the spatial query on its own takes 50 ms
>>> i.e just the text:query line
>>> and the text on its own takes 258 ms
>>>
>>> With the OPTIONAL {} and other properties
>>> =========================================
>>> Spatial and other properties 135 ms
>>> Text and other properties 854 ms
>>>
>>> Again, repeated thanks for you help.
>>>
>>> Mark
>>>
>>> Technology Lead, Iotic Labs
>>> [email protected]
>>> https://www.iotic-labs.com
>>>
>>> On 22/12/15 17:22, Andy Seaborne wrote:
>>>> Mark,
>>>>
>>>> Thanks for the experiment results.
>>>>
>>>> On 22/12/15 15:47, Mark Wharton wrote:
>>>>> Query below run without Andy's switches.
>>>>>    INFO  [5] 200 OK (4.985 s)
>>>>>
>>>>> Query below run with Andy's switches.
>>>>>    INFO  [1] 200 OK (840 ms)
>>>>>
>>>>> Them's some magic switches.  Thanks, Andy.
>>>>>
>>>>> Do they have any impact (negative or positive) on any other SPARQL
>>>>> operations?  I'm only curious as you've solved our main problem in that
>>>>> our "search" query was very slow.  There's nowhere else that uses the
>>>>> text and spatial indexes for retrieval.
>>>>
>>>> This depends on any internal change in the latest release (Jena 3.0.1,
>>>> Fuseki 2.3.1). Prior to that it will not make the same difference.
>>>> Specially, unoptimized joins are now hash joins.
>>>>
>>>> But that is not a good choice for the "?ent rdf:type iotic:Entity"
>>>> triple pattern.  The system can't distinguish different cases involving
>>>> external indexes as it knows not very much about the index details.
>>>>
>>>> Adding
>>>>
>>>> FILTER EXISTS { ?ext rdf:type iotic:Entity }
>>>>
>>>> might work because the triple pattern is really a check, not a match
>>>> setting a variable.
>>>>
>>>> A plain "?ent rdf:type iotic:Entity" will retrieve all things of that
>>>> class regardless of spatial and text query when those optimization
>>>> are off.
>>>>
>>>>      Andy
>>>>
>>>>>
>>>>> Many thanks for this help so close to the holiday season.  Happy
>>>>> holidays to you all at Jena - keep up the good work.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> Technology Lead, Iotic Labs
>>>>> +44 7973 674404
>>>>> [email protected]
>>>>> https://www.iotic-labs.com
>>>>>
>>>>> On 22/12/15 11:49, Andy Seaborne wrote:
>>>>>> Mark - here is another way.
>>>>>>
>>>>>> This query:
>>>>>>
>>>>>> SELECT ?score ?ent
>>>>>> WHERE {
>>>>>>      { ?ent spatial:nearby ( .... ) }
>>>>>>      { ?ent text:query ( ..... ) }
>>>>>>      # No ?ent rdf:type iotic:Entity .
>>>>>>      # This focuses the query on the presenting issue.
>>>>>> }
>>>>>>
>>>>>> and then run Fuseki with the following flags:
>>>>>>
>>>>>>     --set arq:optIndexJoinStrategy=false --set arq:optMergeBGPs=false
>>>>>>
>>>>>> for however you are running the server.
>>>>>>
>>>>>> You need both --set
>>>>>>
>>>>>> The service script will not do this very easily - if environment
>>>>>> variable FUSEKI_ARGS is set it might do. Untested.
>>>>>>
>>>>>> It is easier to run the server standalone:
>>>>>>
>>>>>> (Linux, Mac)
>>>>>>
>>>>>> The "fuseki-server" script should pass these in:
>>>>>>
>>>>>> fuseki-server \
>>>>>>     --set arq:optIndexJoinStrategy=false --set
>>>>>> arq:optMergeBGPs=false \
>>>>>>     .. other args ..
>>>>>>
>>>>>> (Windows or any platform)
>>>>>>
>>>>>> You can call the server java code directly: all one line:
>>>>>>
>>>>>>
>>>>>> java -Xmx1200M -jar fuseki-server.jar --set
>>>>>> arq:optIndexJoinStrategy=false --set arq:optMergeBGPs=false .. other
>>>>>> args ..
>>>>>>
>>>>>> you'll need to put the full path name of fuseki-server.jar
>>>>>>
>>>>>> Sorry - I don't have your setup to test this fully. I did make sure
>>>>>> that
>>>>>> the reworked query does lead to an execution plan that is different
>>>>>> and
>>>>>> should yield some information about the situation.
>>>>>>
>>>>>>       Andy
>>>>>>
>>>>>> On 22/12/15 09:50, Andy Seaborne wrote:
>>>>>>> On 22/12/15 07:06, Mark Wharton wrote:
>>>>>>>> Ah, wheels within wheels.
>>>>>>>>
>>>>>>>> The formulation with the filter in it is fine, except that if you
>>>>>>>> want
>>>>>>>> to search for more than one word or you match in label and comment
>>>>>>>> then
>>>>>>>> the UNION formulation returns you duplicate rows.  This isn't a
>>>>>>>> problem
>>>>>>>> with the Lucene search which is why (I now remember) I used it in
>>>>>>>> the
>>>>>>>> first place.
>>>>>>>>
>>>>>>>> I'm not sure what version of jena I'm using - I just use the fuseki
>>>>>>>> release at 2.3.0.  Is there a way to find out?
>>>>>>>
>>>>>>> 3.0.0
>>>>>>>
>>>>>>> Many of the java commands support --version and the fuseki- server
>>>>>>> jar
>>>>>>> is an all-in-one jar:
>>>>>>>
>>>>>>> java -cp <YourInstall>/fuseki-server.jar arq.sparql --version
>>>>>>>
>>>>>>>> What's the status on the JENA-999 and JENA-1093 issues?  I see
>>>>>>>> there's
>>>>>>>> been some activity on 999 in the last few days. Andy Seaborne's last
>>>>>>>> comment seems encouraging.
>>>>>>>>
>>>>>>>> I don't want to adopt a single version as I'll be stuck forever
>>>>>>>> patching
>>>>>>>> back and forward and it will break eventually.
>>>>>>>>
>>>>>>>> Many thanks for your continued help.
>>>>>>>
>>>>>>> JENA-999 may sort of help but I'm not that positive because each ?ent
>>>>>>> from the first part will be different going into the second part.  It
>>>>>>> looks to me as if it is the overhead of going out to Lucene. (This is
>>>>>>> Lucene right? not Solr?)
>>>>>>>
>>>>>>> The ideal is some super compilation of the text:query and spatial
>>>>>>> query
>>>>>>> into one big Lucene query.
>>>>>>>
>>>>>>> What would also be good, which is stop the general optimizer (this is
>>>>>>> nothing to do with TDB) using an index join.  Except that is the
>>>>>>> better
>>>>>>> choice for the rdf:type.  This is what the addition {} were trying
>>>>>>> for
>>>>>>> except the optimizer outsmarted
>>>>>>>
>>>>>>> SELECT ?score ?ent
>>>>>>> WHERE {
>>>>>>>     ?ent spatial:nearby( ...) .
>>>>>>>     (?ent ?score) text:query (...) .
>>>>>>>     ?ent rdf:type iotic:Entity .
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Mark - can you ask the query from Java?  If so,
>>>>>>>
>>>>>>> Add  "Optimize.noOptimizer(); " before executing the query.  I can't
>>>>>>> see
>>>>>>> a way to do that from setting the environment for Fuseki.
>>>>>>>
>>>>>>> Or (the effect on time of this is version specific and whether it
>>>>>>> does
>>>>>>> anything useful is a big "maybe") you could try this:
>>>>>>>
>>>>>>> SELECT ?score ?ent
>>>>>>> WHERE {
>>>>>>>     { OPTIONAL { ?ent spatial:nearby "ABC" . }}
>>>>>>>     { OPTIONAL { ?ent  text:query "DEF" } }
>>>>>>> }
>>>>>>>
>>>>>>>        Andy
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>

Reply via email to