Hi Andy Thanks for your help. I've integrated those changes into the main search query and it's about 10x faster.
Just one last question... What does the OFFSET 0 bit do in the query - force it to use the spatial/text index in preference to any other? Mark Technology Lead, Iotic Labs +44 7973 674404 [email protected] https://www.iotic-labs.com On 24/12/15 05:38, Mark Wharton wrote: > Hi Andy. > > That's cracked it. I was wondering about the sub-select route, but > wasn't sure how to code the intersection part. I just tweaked it to > return the score from the text query > > Your formulation > 200 OK (231 ms) > > That's 200 OK by me... > > Enjoy the holidays > > Mark > > Technology Lead, Iotic Labs > +44 7973 674404 > [email protected] > https://www.iotic-labs.com > > On 23/12/15 17:03, Andy Seaborne wrote: >> Hi Mark, >> >> Tricky. >> >> There isn't a good way to turn off or modify optimization for parts of a >> query without affecting the whole query. Jena 3.0.1 had a combination >> of changes - hash join but also stronger flattening queries into the >> form you don't want for the first part. >> >> The best I have come up with is: >> (no special flags needed) >> >> >> SELECT ?score ?ent >> WHERE { >> { SELECT ?ent { ?ent spatial:nearby "ABC" . } OFFSET 0 } >> { SELECT ?ent { ?ent text:query "DEF" . } OFFSET 0 } >> ... rest of query ... >> >> } >> >> i.e. >> >> SELECT ?score ?ent ?entLabel ?lat ?long ?point ?pointType ?pointLabel >> WHERE { >> { SELECT ?ent { >> ?ent spatial:nearby(51.507999420166016 -0.10999999940395355 >> 70.8018078804016'km') . >> } OFFSET 0 } >> { SELECT ?ent { >> (?ent ?score) text:query ('environment' 'lang:en') . >> } OFFSET 0 } >> >> ?ent rdf:type iotic:Entity >> >> OPTIONAL { >> ?ent rdfs:label ?entLabel . >> FILTER langMatches( lang(?entLabel), 'en' ) . >> } >> >> OPTIONAL {?ent geo:lat ?lat . ?ent geo:long ?long} >> ?ent iotic:Advertises ?point . >> ?point rdf:type iotic:Point . >> ?point iotic:PointType ?pointType . >> >> OPTIONAL { >> ?point rdfs:label ?pointLabel . >> FILTER langMatches( lang(?pointLabel), 'en' ) . >> } >> } >> >> >> On 23/12/15 11:03, Mark Wharton wrote: >>> Hi Andy. >>> >>> More experiments this morning. I originally only send you a small part >>> of a larger query just to expose the problem in its simplest form. And >>> your switches work well in that case (i.e. first formulation below >>> *with* the comments.) >>> >>> But... There's a problem when using the switches in that the rest of the >>> query wants to get the rdfs:label and various other properties. This >>> destroys the performance gains. >>> >>> I've tried "yours" and "mine" with and without the switches and then the >>> separate parts on their own to see how that goes. >>> >>> 1) "yours" >>> ========== >>> This formulation (with the switches and comments in place) - 384 ms >>> >>> SELECT ?score ?ent ?entLabel ?lat ?long ?point ?pointType ?pointLabel >>> WHERE { >>> { ?ent spatial:nearby(51.507999420166016 -0.10999999940395355 >>> 70.8018078804016'km') } >>> { (?ent ?score) text:query ('environment' 'lang:en') .FILTER EXISTS >>> {?ent rdf:type iotic:Entity} } >>> >>> # OPTIONAL { >>> # ?ent rdfs:label ?entLabel . >>> # FILTER langMatches( lang(?entLabel), 'en' ) . >>> # } >>> # >>> # OPTIONAL {?ent geo:lat ?lat . ?ent geo:long ?long} >>> # ?ent iotic:Advertises ?point . >>> # ?point rdf:type iotic:Point . >>> # ?point iotic:PointType ?pointType . >>> # >>> # OPTIONAL { >>> # ?point rdfs:label ?pointLabel . >>> # FILTER langMatches( lang(?pointLabel), 'en' ) . >>> # } >>> >>> } >>> >>> Uncomment the lines and the performance drops to - 7.165 ms >>> >>> 2) "mine" >>> ========= >>> The below formulation with the switches in place 11.221 secs >>> The below without the switches. 5.371 secs >>> >>> SELECT ?score ?ent ?entLabel ?lat ?long ?point ?pointType ?pointLabel >>> WHERE { >>> ?ent spatial:nearby(51.507999420166016 -0.10999999940395355 >>> 70.8018078804016'km') . >>> (?ent ?score) text:query ('environment' 'lang:en') .FILTER EXISTS >>> {?ent rdf:type iotic:Entity} . >>> >>> OPTIONAL { >>> ?ent rdfs:label ?entLabel . >>> FILTER langMatches( lang(?entLabel), 'en' ) . >>> } >>> >>> OPTIONAL {?ent geo:lat ?lat . ?ent geo:long ?long} >>> ?ent iotic:Advertises ?point . >>> ?point rdf:type iotic:Point . >>> ?point iotic:PointType ?pointType . >>> >>> OPTIONAL { >>> ?point rdfs:label ?pointLabel . >>> FILTER langMatches( lang(?pointLabel), 'en' ) . >>> } >>> >>> } >>> >>> 3) Separately >>> ============== >>> Completely on their own: >>> ======================== >>> i.e. just the ?ent spatial:nearby line >>> the spatial query on its own takes 50 ms >>> i.e just the text:query line >>> and the text on its own takes 258 ms >>> >>> With the OPTIONAL {} and other properties >>> ========================================= >>> Spatial and other properties 135 ms >>> Text and other properties 854 ms >>> >>> Again, repeated thanks for you help. >>> >>> Mark >>> >>> Technology Lead, Iotic Labs >>> [email protected] >>> https://www.iotic-labs.com >>> >>> On 22/12/15 17:22, Andy Seaborne wrote: >>>> Mark, >>>> >>>> Thanks for the experiment results. >>>> >>>> On 22/12/15 15:47, Mark Wharton wrote: >>>>> Query below run without Andy's switches. >>>>> INFO [5] 200 OK (4.985 s) >>>>> >>>>> Query below run with Andy's switches. >>>>> INFO [1] 200 OK (840 ms) >>>>> >>>>> Them's some magic switches. Thanks, Andy. >>>>> >>>>> Do they have any impact (negative or positive) on any other SPARQL >>>>> operations? I'm only curious as you've solved our main problem in that >>>>> our "search" query was very slow. There's nowhere else that uses the >>>>> text and spatial indexes for retrieval. >>>> >>>> This depends on any internal change in the latest release (Jena 3.0.1, >>>> Fuseki 2.3.1). Prior to that it will not make the same difference. >>>> Specially, unoptimized joins are now hash joins. >>>> >>>> But that is not a good choice for the "?ent rdf:type iotic:Entity" >>>> triple pattern. The system can't distinguish different cases involving >>>> external indexes as it knows not very much about the index details. >>>> >>>> Adding >>>> >>>> FILTER EXISTS { ?ext rdf:type iotic:Entity } >>>> >>>> might work because the triple pattern is really a check, not a match >>>> setting a variable. >>>> >>>> A plain "?ent rdf:type iotic:Entity" will retrieve all things of that >>>> class regardless of spatial and text query when those optimization >>>> are off. >>>> >>>> Andy >>>> >>>>> >>>>> Many thanks for this help so close to the holiday season. Happy >>>>> holidays to you all at Jena - keep up the good work. >>>>> >>>>> Mark >>>>> >>>>> >>>>> Technology Lead, Iotic Labs >>>>> +44 7973 674404 >>>>> [email protected] >>>>> https://www.iotic-labs.com >>>>> >>>>> On 22/12/15 11:49, Andy Seaborne wrote: >>>>>> Mark - here is another way. >>>>>> >>>>>> This query: >>>>>> >>>>>> SELECT ?score ?ent >>>>>> WHERE { >>>>>> { ?ent spatial:nearby ( .... ) } >>>>>> { ?ent text:query ( ..... ) } >>>>>> # No ?ent rdf:type iotic:Entity . >>>>>> # This focuses the query on the presenting issue. >>>>>> } >>>>>> >>>>>> and then run Fuseki with the following flags: >>>>>> >>>>>> --set arq:optIndexJoinStrategy=false --set arq:optMergeBGPs=false >>>>>> >>>>>> for however you are running the server. >>>>>> >>>>>> You need both --set >>>>>> >>>>>> The service script will not do this very easily - if environment >>>>>> variable FUSEKI_ARGS is set it might do. Untested. >>>>>> >>>>>> It is easier to run the server standalone: >>>>>> >>>>>> (Linux, Mac) >>>>>> >>>>>> The "fuseki-server" script should pass these in: >>>>>> >>>>>> fuseki-server \ >>>>>> --set arq:optIndexJoinStrategy=false --set >>>>>> arq:optMergeBGPs=false \ >>>>>> .. other args .. >>>>>> >>>>>> (Windows or any platform) >>>>>> >>>>>> You can call the server java code directly: all one line: >>>>>> >>>>>> >>>>>> java -Xmx1200M -jar fuseki-server.jar --set >>>>>> arq:optIndexJoinStrategy=false --set arq:optMergeBGPs=false .. other >>>>>> args .. >>>>>> >>>>>> you'll need to put the full path name of fuseki-server.jar >>>>>> >>>>>> Sorry - I don't have your setup to test this fully. I did make sure >>>>>> that >>>>>> the reworked query does lead to an execution plan that is different >>>>>> and >>>>>> should yield some information about the situation. >>>>>> >>>>>> Andy >>>>>> >>>>>> On 22/12/15 09:50, Andy Seaborne wrote: >>>>>>> On 22/12/15 07:06, Mark Wharton wrote: >>>>>>>> Ah, wheels within wheels. >>>>>>>> >>>>>>>> The formulation with the filter in it is fine, except that if you >>>>>>>> want >>>>>>>> to search for more than one word or you match in label and comment >>>>>>>> then >>>>>>>> the UNION formulation returns you duplicate rows. This isn't a >>>>>>>> problem >>>>>>>> with the Lucene search which is why (I now remember) I used it in >>>>>>>> the >>>>>>>> first place. >>>>>>>> >>>>>>>> I'm not sure what version of jena I'm using - I just use the fuseki >>>>>>>> release at 2.3.0. Is there a way to find out? >>>>>>> >>>>>>> 3.0.0 >>>>>>> >>>>>>> Many of the java commands support --version and the fuseki- server >>>>>>> jar >>>>>>> is an all-in-one jar: >>>>>>> >>>>>>> java -cp <YourInstall>/fuseki-server.jar arq.sparql --version >>>>>>> >>>>>>>> What's the status on the JENA-999 and JENA-1093 issues? I see >>>>>>>> there's >>>>>>>> been some activity on 999 in the last few days. Andy Seaborne's last >>>>>>>> comment seems encouraging. >>>>>>>> >>>>>>>> I don't want to adopt a single version as I'll be stuck forever >>>>>>>> patching >>>>>>>> back and forward and it will break eventually. >>>>>>>> >>>>>>>> Many thanks for your continued help. >>>>>>> >>>>>>> JENA-999 may sort of help but I'm not that positive because each ?ent >>>>>>> from the first part will be different going into the second part. It >>>>>>> looks to me as if it is the overhead of going out to Lucene. (This is >>>>>>> Lucene right? not Solr?) >>>>>>> >>>>>>> The ideal is some super compilation of the text:query and spatial >>>>>>> query >>>>>>> into one big Lucene query. >>>>>>> >>>>>>> What would also be good, which is stop the general optimizer (this is >>>>>>> nothing to do with TDB) using an index join. Except that is the >>>>>>> better >>>>>>> choice for the rdf:type. This is what the addition {} were trying >>>>>>> for >>>>>>> except the optimizer outsmarted >>>>>>> >>>>>>> SELECT ?score ?ent >>>>>>> WHERE { >>>>>>> ?ent spatial:nearby( ...) . >>>>>>> (?ent ?score) text:query (...) . >>>>>>> ?ent rdf:type iotic:Entity . >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> Mark - can you ask the query from Java? If so, >>>>>>> >>>>>>> Add "Optimize.noOptimizer(); " before executing the query. I can't >>>>>>> see >>>>>>> a way to do that from setting the environment for Fuseki. >>>>>>> >>>>>>> Or (the effect on time of this is version specific and whether it >>>>>>> does >>>>>>> anything useful is a big "maybe") you could try this: >>>>>>> >>>>>>> SELECT ?score ?ent >>>>>>> WHERE { >>>>>>> { OPTIONAL { ?ent spatial:nearby "ABC" . }} >>>>>>> { OPTIONAL { ?ent text:query "DEF" } } >>>>>>> } >>>>>>> >>>>>>> Andy >>>>>>> >>>>>> >>>>>> >>>> >>
