Hi Osma.
Please don't apologise. It's nice to have someone to discuss this with
who is so far up the curve.
I was just going to rewrite exactly how you suggest (giving in?) but you
beat me to it.
SELECT ?score ?ent
WHERE {
?ent spatial:nearby(51.507999420166016 -0.10999999940395355
70.01807880401611 'km') .
{
{ ?ent rdfs:label ?label .
FILTER (regex(?label, "environment", "i") &&
langmatches(lang(?label), "en"))
}
UNION
{ ?ent rdfs:comment ?comment .
FILTER (regex(?comment, "environment", "i") &&
langmatches(lang(?comment), "en")) }
}
?ent rdf:type iotic:Entity .
}
gets the same results (sans ?score, but who cares?) and completes in
61ms! No contest.
Thanks for your help
Mark
Technology Lead, Iotic Labs
+44 7973 674404
[email protected]
https://www.iotic-labs.com
On 21/12/15 14:30, Osma Suominen wrote:
> Hi Mark!
>
> Thanks for trying my queries. I'm sorry (but not surprised!) to hear
> that they weren't any better than your original query.
>
> I think JENA-999 is really the key here - until it is implemented, I
> don't know any way of speeding this up. My (very fuzzy) understanding of
> ARQ is that it always performs joins with iterators, so there is no way
> to first retrieve both result sets (one from spatial, one from
> jena-text) and only then join them, unless you do the cross product and
> only limit with FILTER, which is even slower (options 2 and 3).
>
> By the way you didn't say which version of Jena you're using. Jena 3.0.1
> actually does include the initial JENA-999 patch which, despite its
> problems (related to retrieving score and especially the original
> literal - see JENA-1093), might actually work in your case. It was
> reverted only after the 3.0.1 release.
>
> Other than testing with 3.0.1, I'd suggest skipping jena-text for now
> and changing the text query part to a FILTER expression instead, with a
> REGEX or CONTAINS. Applying that kind of filter to 450 items shouldn't
> add much to the execution time of the spatial query.
>
> -Osma
>
> [1] https://issues.apache.org/jira/browse/JENA-999
> [2] https://issues.apache.org/jira/browse/JENA-1093
>
> On 21/12/15 13:02, Mark Wharton wrote:
>> Hi Osma.
>>
>> Thanks for your help. It was exactly the kind of help that I wanted.
>>
>> Haha, but life is rarely that simple...
>>
>> I ran all your versions and my original a couple of times each to see if
>> there was any difference in performance.
>>
>> query time1 time2
>> Orig 4.615 4.632
>> 1 4.648 4.608
>> 2 6.385 6.401
>> 3 6.353 6.442
>>
>> I think that proves that my "naive" version is working how I thought,
>> (and the same as your version 1). i.e. getting all the results from the
>> spatial predicate and then individually checking them against the text
>> predicate. (Which is why it's slower if you swap the predicates over,
>> as there are more matches on the text).
>>
>> This kind of query can't be that unusual, surely? Things with "foo" in
>> their text within 50Km of point "bar"?
>>
>> Mark
>>
>>
>> Technology Lead, Iotic Labs
>> +44 7973 674404
>> [email protected]
>> https://www.iotic-labs.com
>>
>> On 21/12/15 07:53, Osma Suominen wrote:
>>> Hi Mark!
>>>
>>> I'm not sure that the jena-external-index approach would help. It might
>>> or might not, depending on how it's implemented. AFAIK it's just an idea
>>> right now, I haven't seen any code.
>>>
>>> In any case I think the problem with jena-text and probably jena-spatial
>>> too (not very familiar with it) is that they are pretty fast when you
>>> can get by doing just a single query with an unbound subject. But when
>>> you instead have a fixed subject, or more likely a list of possible
>>> subjects, the performance will be very bad because every subject will be
>>> queried separately from the Lucene index. This is probably what happens
>>> with your query - the results of one query will be fed to the other.
>>> JENA-999 tried to address some of this by introducing a cache into
>>> jena-text, but the patch that was committed had other problems (it
>>> returned the wrong results in some cases) so it was reverted recently
>>> and an improved version hasn't yet been developed.
>>>
>>> In any case, you could try these variants of the original query:
>>> (I'm really on thin ground here, but I spent a couple of days last week
>>> trying to optimize a somewhat similar query involving jena-text and
>>> various other conditions, trying to find a solution with good
>>> performance)
>>>
>>> 1. Try to isolate the text and spatial queries with extra braces
>>>
>>> SELECT ?score ?ent
>>> WHERE {
>>> { ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>>> 70.01807880401611 'km') }
>>> { (?ent ?score) text:query ('environment' 'lang:en') }
>>> ?ent rdf:type iotic:Entity .
>>> }
>>>
>>> 2. Use a FILTER expression to delay matching the results
>>>
>>> SELECT ?score ?ent
>>> WHERE {
>>> ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>>> 70.01807880401611 'km') .
>>> (?ent2 ?score) text:query ('environment' 'lang:en') .
>>> FILTER (?ent = ?ent2)
>>> ?ent rdf:type iotic:Entity .
>>> }
>>>
>>> 3. Combination of above ideas
>>>
>>> SELECT ?score ?ent
>>> WHERE {
>>> { ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>>> 70.01807880401611 'km') }
>>> { (?ent2 ?score) text:query ('environment' 'lang:en') }
>>> FILTER (?ent = ?ent2)
>>> ?ent rdf:type iotic:Entity .
>>> }
>>>
>>>
>>> -Osma
>>>
>>>
>>> On 21/12/15 00:28, Mark Wharton wrote:
>>>> Hi Marco.
>>>>
>>>> Yes, that's it. The indexes work well in isolation, but don't combine
>>>> well. Smooshing them into a single index would be a great idea,
>>>> especially if the query could resolve both text and spatial predicates
>>>> with one matching scan of the index.
>>>>
>>>> Perhaps Stephen could be persuaded to pick up the pace on this one?
>>>>
>>>> Thanks Mark
>>>>
>>>> On 20 December 2015 12:53:39 GMT+00:00, Marco Neumann
>>>> <[email protected]> wrote:
>>>>> yes correct Mark I am only referring to the extra payload here for
>>>>> invoking the spatial filter in the SPARQL query.
>>>>>
>>>>> now that you mention a particular issue with the combined use of both
>>>>> jena-text and jena-spatial (something I am not aware of ) this might
>>>>> be related to duplicated code in the two projects. back in May Stephen
>>>>> Allen wrote on the dev-list that he is about to address some of this
>>>>> possibly in a new jena-external-index project.
>>>>>
>>>>> http://mail-archives.apache.org/mod_mbox/jena-dev/201505.mbox/%3ccaptxtvpwu2ijogyj0kx8o6-07yokk5g1t32b_k3g_cjaqvk...@mail.gmail.com%3E
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Dec 20, 2015 at 2:29 AM, Mark Wharton
>>>>> <[email protected]> wrote:
>>>>>> Hi
>>>>>>
>>>>>> Thanks for this. I've read the chapter in the book and now I'm not
>>>>> sure
>>>>>> if I misunderstand your reply or you've only addressed half of the
>>>>> problem.
>>>>>>
>>>>>> I'm not worried about the performance of the spatial search in
>>>>> isolation
>>>>>> - that's 97ms which is fine. The text search on its own takes a bit
>>>>>> longer but that's acceptable, too.
>>>>>>
>>>>>> It's when I put the spatial and text *together* that query time
>>>>> increase
>>>>>> by 10-30 times. That's the bit I don't understand and would like
>>>>> some
>>>>>> help with.
>>>>>>
>>>>>> Is there a SPARQL query formulation that can "AND the indexes" rather
>>>>>> than retrieving one set and looping through to retrieve the matches
>>>>>> individually on the other. (Which is my guess as to how it works).
>>>>>>
>>>>>> Thanks for your help so far.
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> Technology Lead, Iotic Labs
>>>>>> [email protected]
>>>>>> https://www.iotic-labs.com
>>>>>>
>>>>>> On 18/12/15 18:59, Marco Neumann wrote:
>>>>>>> it's a common spatial access method latency in paticular for small
>>>>>>> data sets. you can try a mbr range query instead.
>>>>>>>
>>>>>>> see Chapter 13 Managing Space and Time in Semantic Web Programming
>>>>> by
>>>>>>> John Hebeler et. al.. 2009
>>>>>>>
>>>>>>> On Fri, Dec 18, 2015 at 10:13 AM, Mark Wharton
>>>>>>> <[email protected]> wrote:
>>>>>>>> Hi Jena users.
>>>>>>>>
>>>>>>>> I'm having performance problems with a query that uses text and
>>>>> location
>>>>>>>> search
>>>>>>>>
>>>>>>>> The query is roughly this:
>>>>>>>>
>>>>>>>>
>>>>>>>> SELECT ?score ?ent
>>>>>>>> WHERE {
>>>>>>>> ?ent spatial:nearby(51.507999420166016 -0.10999999940395355
>>>>>>>> 70.01807880401611 'km') .
>>>>>>>> (?ent ?score) text:query ('environment' 'lang:en') .
>>>>>>>> ?ent rdf:type iotic:Entity .
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> There are about 450 entities in that radius
>>>>>>>> There are about 2200 entities with environment in their
>>>>> rdfs:comment
>>>>>>>>
>>>>>>>> The query takes 5 seconds.
>>>>>>>>
>>>>>>>> I've tried this:
>>>>>>>> Commenting out the text predicate the query takes 97 ms
>>>>>>>> Commenting out the spatial predicate the query takes 438 ms
>>>>>>>> Swapping the spatial and text predicates it takes 15 seconds
>>>>>>>>
>>>>>>>>
>>>>>>>> My question is this... It looks like the query is separately
>>>>> getting
>>>>>>>> the results of the first two predicates and merging (somehow) to
>>>>> find
>>>>>>>> the intersection. Is there a formulation which will intersect the
>>>>> two
>>>>>>>> sets faster?
>>>>>>>>
>>>>>>>> Many TIAs,
>>>>>>>>
>>>>>>>> Mark
>>>>>>>> --
>>>>>>>> Technology Lead, Iotic Labs
>>>>>>>> [email protected]
>>>>>>>> https://www.iotic-labs.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> ---
>>>>> Marco Neumann
>>>>> KONA
>>>>
>>>
>>>
>
>