Re: Query with spatial and text searches.

Mark Wharton Mon, 21 Dec 2015 23:08:04 -0800

Ah, wheels within wheels.

The formulation with the filter in it is fine, except that if you want
to search for more than one word or you match in label and comment then
the UNION formulation returns you duplicate rows.  This isn't a problem
with the Lucene search which is why (I now remember) I used it in the
first place.


I'm not sure what version of jena I'm using - I just use the fuseki
release at 2.3.0.  Is there a way to find out?

What's the status on the JENA-999 and JENA-1093 issues?  I see there's
been some activity on 999 in the last few days. Andy Seaborne's last
comment seems encouraging.

I don't want to adopt a single version as I'll be stuck forever patching
back and forward and it will break eventually.

Many thanks for your continued help.

Mark

Technology Lead, Iotic Labs
[email protected]
https://www.iotic-labs.com

On 21/12/15 16:04, Mark Wharton wrote:
> Hi Osma.
> 
> Please don't apologise.  It's nice to have someone to discuss this with
> who is so far up the curve.
> 
> I was just going to rewrite exactly how you suggest (giving in?) but you
> beat me to it.
> 
> SELECT ?score ?ent
> WHERE {
> ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
> 70.01807880401611 'km') .
> {
>  { ?ent rdfs:label ?label .
>    FILTER (regex(?label, "environment", "i") &&
>            langmatches(lang(?label), "en"))
>  }
>   UNION
>  { ?ent rdfs:comment ?comment .
>    FILTER (regex(?comment, "environment", "i") &&
>            langmatches(lang(?comment), "en")) }
> }
> ?ent rdf:type iotic:Entity .
> }
> 
> gets the same results (sans ?score, but who cares?) and completes in
> 61ms!  No contest.
> 
> Thanks for your help
> 
> Mark
> 
> Technology Lead, Iotic Labs
> +44 7973 674404
> [email protected]
> https://www.iotic-labs.com
> 
> On 21/12/15 14:30, Osma Suominen wrote:
>> Hi Mark!
>>
>> Thanks for trying my queries. I'm sorry (but not surprised!) to hear
>> that they weren't any better than your original query.
>>
>> I think JENA-999 is really the key here - until it is implemented, I
>> don't know any way of speeding this up. My (very fuzzy) understanding of
>> ARQ is that it always performs joins with iterators, so there is no way
>> to first retrieve both result sets (one from spatial, one from
>> jena-text) and only then join them, unless you do the cross product and
>> only limit with FILTER, which is even slower (options 2 and 3).
>>
>> By the way you didn't say which version of Jena you're using. Jena 3.0.1
>> actually does include the initial JENA-999 patch which, despite its
>> problems (related to retrieving score and especially the original
>> literal - see JENA-1093), might actually work in your case. It was
>> reverted only after the 3.0.1 release.
>>
>> Other than testing with 3.0.1, I'd suggest skipping jena-text for now
>> and changing the text query part to a FILTER expression instead, with a
>> REGEX or CONTAINS. Applying that kind of filter to 450 items shouldn't
>> add much to the execution time of the spatial query.
>>
>> -Osma
>>
>> [1] https://issues.apache.org/jira/browse/JENA-999
>> [2] https://issues.apache.org/jira/browse/JENA-1093
>>
>> On 21/12/15 13:02, Mark Wharton wrote:
>>> Hi Osma.
>>>
>>> Thanks for your help.  It was exactly the kind of help that I wanted.
>>>
>>> Haha, but life is rarely that simple...
>>>
>>> I ran all your versions and my original a couple of times each to see if
>>> there was any difference in performance.
>>>
>>> query    time1   time2
>>> Orig    4.615   4.632
>>> 1       4.648   4.608
>>> 2       6.385   6.401
>>> 3       6.353   6.442
>>>
>>> I think that proves that my "naive" version is working how I thought,
>>> (and the same as your version 1). i.e. getting all the results from the
>>> spatial predicate and then individually checking them against the text
>>> predicate.  (Which is why it's slower if you swap the predicates over,
>>> as there are more matches on the text).
>>>
>>> This kind of query can't be that unusual, surely?  Things with "foo" in
>>> their text within 50Km of point "bar"?
>>>
>>> Mark
>>>
>>>
>>> Technology Lead, Iotic Labs
>>> +44 7973 674404
>>> [email protected]
>>> https://www.iotic-labs.com
>>>
>>> On 21/12/15 07:53, Osma Suominen wrote:
>>>> Hi Mark!
>>>>
>>>> I'm not sure that the jena-external-index approach would help. It might
>>>> or might not, depending on how it's implemented. AFAIK it's just an idea
>>>> right now, I haven't seen any code.
>>>>
>>>> In any case I think the problem with jena-text and probably jena-spatial
>>>> too (not very familiar with it) is that they are pretty fast when you
>>>> can get by doing just a single query with an unbound subject. But when
>>>> you instead have a fixed subject, or more likely a list of possible
>>>> subjects, the performance will be very bad because every subject will be
>>>> queried separately from the Lucene index. This is probably what happens
>>>> with your query - the results of one query will be fed to the other.
>>>> JENA-999 tried to address some of this by introducing a cache into
>>>> jena-text, but the patch that was committed had other problems (it
>>>> returned the wrong results in some cases) so it was reverted recently
>>>> and an improved version hasn't yet been developed.
>>>>
>>>> In any case, you could try these variants of the original query:
>>>> (I'm really on thin ground here, but I spent a couple of days last week
>>>> trying to optimize a somewhat similar query involving jena-text and
>>>> various other conditions, trying to find a solution with good
>>>> performance)
>>>>
>>>> 1. Try to isolate the text and spatial queries with extra braces
>>>>
>>>> SELECT ?score ?ent
>>>> WHERE {
>>>>   { ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>>>>                      70.01807880401611 'km') }
>>>>   { (?ent ?score) text:query ('environment' 'lang:en') }
>>>>   ?ent rdf:type iotic:Entity .
>>>> }
>>>>
>>>> 2. Use a FILTER expression to delay matching the results
>>>>
>>>> SELECT ?score ?ent
>>>> WHERE {
>>>>   ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>>>>                      70.01807880401611 'km') .
>>>>   (?ent2 ?score) text:query ('environment' 'lang:en') .
>>>>   FILTER (?ent = ?ent2)
>>>>   ?ent rdf:type iotic:Entity .
>>>> }
>>>>
>>>> 3. Combination of above ideas
>>>>
>>>> SELECT ?score ?ent
>>>> WHERE {
>>>>   { ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>>>>                      70.01807880401611 'km') }
>>>>   { (?ent2 ?score) text:query ('environment' 'lang:en') }
>>>>   FILTER (?ent = ?ent2)
>>>>   ?ent rdf:type iotic:Entity .
>>>> }
>>>>
>>>>
>>>> -Osma
>>>>
>>>>
>>>> On 21/12/15 00:28, Mark Wharton wrote:
>>>>> Hi Marco.
>>>>>
>>>>> Yes, that's it. The indexes work well in isolation, but don't combine
>>>>> well. Smooshing them into a single index would be a great idea,
>>>>> especially if the query could resolve both text and spatial predicates
>>>>> with one matching scan of the index.
>>>>>
>>>>> Perhaps Stephen could be persuaded to pick up the pace on this one?
>>>>>
>>>>> Thanks Mark
>>>>>
>>>>> On 20 December 2015 12:53:39 GMT+00:00, Marco Neumann
>>>>> <[email protected]> wrote:
>>>>>> yes correct Mark I am only referring to the extra payload here for
>>>>>> invoking the spatial filter in the SPARQL query.
>>>>>>
>>>>>> now that you mention a particular issue with the combined use of both
>>>>>> jena-text and jena-spatial (something I am not aware of ) this might
>>>>>> be related to duplicated code in the two projects. back in May Stephen
>>>>>> Allen wrote on the dev-list that he is about to address some of this
>>>>>> possibly in a new jena-external-index project.
>>>>>>
>>>>>> http://mail-archives.apache.org/mod_mbox/jena-dev/201505.mbox/%3ccaptxtvpwu2ijogyj0kx8o6-07yokk5g1t32b_k3g_cjaqvk...@mail.gmail.com%3E
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Dec 20, 2015 at 2:29 AM, Mark Wharton
>>>>>> <[email protected]> wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> Thanks for this.  I've read the chapter in the book and now I'm not
>>>>>> sure
>>>>>>> if I misunderstand your reply or you've only addressed half of the
>>>>>> problem.
>>>>>>>
>>>>>>> I'm not worried about the performance of the spatial search in
>>>>>> isolation
>>>>>>> - that's 97ms which is fine.  The text search on its own takes a bit
>>>>>>> longer but that's acceptable, too.
>>>>>>>
>>>>>>> It's when I put the spatial and text *together* that query time
>>>>>> increase
>>>>>>> by 10-30 times.  That's the bit I don't understand and would like
>>>>>> some
>>>>>>> help with.
>>>>>>>
>>>>>>> Is there a SPARQL query formulation that can "AND the indexes" rather
>>>>>>> than retrieving one set and looping through to retrieve the matches
>>>>>>> individually on the other.  (Which is my guess as to how it works).
>>>>>>>
>>>>>>> Thanks for your help so far.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> Technology Lead, Iotic Labs
>>>>>>> [email protected]
>>>>>>> https://www.iotic-labs.com
>>>>>>>
>>>>>>> On 18/12/15 18:59, Marco Neumann wrote:
>>>>>>>> it's a common spatial access method latency in paticular for small
>>>>>>>> data sets. you can try a mbr range query instead.
>>>>>>>>
>>>>>>>> see Chapter 13 Managing Space and Time in Semantic Web Programming
>>>>>> by
>>>>>>>> John Hebeler et. al.. 2009
>>>>>>>>
>>>>>>>> On Fri, Dec 18, 2015 at 10:13 AM, Mark Wharton
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> Hi Jena users.
>>>>>>>>>
>>>>>>>>> I'm having performance problems with a query that uses text and
>>>>>> location
>>>>>>>>> search
>>>>>>>>>
>>>>>>>>> The query is roughly this:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> SELECT ?score ?ent
>>>>>>>>> WHERE {
>>>>>>>>>    ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>>>>>>>>>                       70.01807880401611 'km') .
>>>>>>>>> (?ent ?score) text:query ('environment' 'lang:en') .
>>>>>>>>>    ?ent rdf:type iotic:Entity .
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There are about 450 entities in that radius
>>>>>>>>> There are about 2200 entities with environment in their
>>>>>> rdfs:comment
>>>>>>>>>
>>>>>>>>> The query takes 5 seconds.
>>>>>>>>>
>>>>>>>>> I've tried this:
>>>>>>>>> Commenting out the text predicate the query takes 97 ms
>>>>>>>>> Commenting out the spatial predicate the query takes 438 ms
>>>>>>>>> Swapping the spatial and text predicates it takes 15 seconds
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> My question is this...  It looks like the query is separately
>>>>>> getting
>>>>>>>>> the results of the first two predicates and merging (somehow) to
>>>>>> find
>>>>>>>>> the intersection.  Is there a formulation which will intersect the
>>>>>> two
>>>>>>>>> sets faster?
>>>>>>>>>
>>>>>>>>> Many TIAs,
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>> -- 
>>>>>>>>> Technology Lead, Iotic Labs
>>>>>>>>> [email protected]
>>>>>>>>> https://www.iotic-labs.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>>
>>>>>>
>>>>>> ---
>>>>>> Marco Neumann
>>>>>> KONA
>>>>>
>>>>
>>>>
>>
>>

Re: Query with spatial *and* text searches.

Reply via email to

Re: Query with spatial and text searches.