That’s a difficult question to answer because it all depends upon your data and
what you consider an acceptable level of performance
Generally speaking, if you find yourself doing a very general pattern and then
filtering with a string function you may be better served by text indexing e.g.
SELECT *
WHERE
{
?s ?p ?o . # Scan all the data
FILTER(STRSTARTS(?label, “foo”))
}
However, if your query first reduces the set of data over which the filter must
apply by doing a more specific pattern then string functions may be fine e.g.
SELECT *
WHERE
{
?s a <urn:some-type> ;
<urn:some-predicate> ?value . # Find some specific subset of the data
FILTER(STRSTARTS(?value, “foo”))
}
But it very much depends on the details and generally it will be best to
benchmark your specific use case on your data and the judge for yourself. It as
you imply you are creating an application which hides the details of SPARQL
from the user you are free to adjust the underlying queries as you see fit
Rob
On 23/05/2017 08:39, "Laura Morales" <[email protected]> wrote:
Oh, this is interesting. I thought that predicates values (rdfs:label in
this case) were already sorted and that using STRSTARTS() would be fast because
it could take advantage of binary search or something. I didn't expect that
this function would have to scan all the predicate values.
So in which scenario are sparql STR functions acceptable to use (in terms
of "reasonable performance")?
Laura Morales kirjoitti 23.05.2017 klo 10:23:
> Thank you for the answer. So let's say I want to search nodes in my graph
by rdfs:label. Is this correct...
>
> 1) STRSTART(): fast by default because predicates are sorted. Only does
exact search.
> 2) STRSTART(LCASE(?label)): fast because predicates are sorted, but just
a little bit slower than 1) because if muse LCASE() some strings
> 3) REGEX(): slow because it must go through all rdfs:labels (use
jena-text instead)
> 4) CONTAINS(): slow because it must go through all rdfs:labels (use
jena-text instead)
>
> Is this correct?
I believe all of these are roughly equivalent in terms of performance.
All of them need to scan all the rdfs:label values. Obviously REGEX is a
bit more expensive than e.g. STRSTARTS but the difference is not very
big. I don't think there's any sorting of predicate values in TDB that
would help here.
> If my app has an input search box where users can search an item by title
(on a large graph), would it be a good idea to go with 2) or should I consider
setting up a text-query index?
I recommend setting up a text index if you want to do partial matching
of labels from a large graph.
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi