Oh, this is interesting. I thought that predicates values (rdfs:label in this case) were already sorted and that using STRSTARTS() would be fast because it could take advantage of binary search or something. I didn't expect that this function would have to scan all the predicate values. So in which scenario are sparql STR functions acceptable to use (in terms of "reasonable performance")?
Laura Morales kirjoitti 23.05.2017 klo 10:23: > Thank you for the answer. So let's say I want to search nodes in my graph by > rdfs:label. Is this correct... > > 1) STRSTART(): fast by default because predicates are sorted. Only does exact > search. > 2) STRSTART(LCASE(?label)): fast because predicates are sorted, but just a > little bit slower than 1) because if muse LCASE() some strings > 3) REGEX(): slow because it must go through all rdfs:labels (use jena-text > instead) > 4) CONTAINS(): slow because it must go through all rdfs:labels (use jena-text > instead) > > Is this correct? I believe all of these are roughly equivalent in terms of performance. All of them need to scan all the rdfs:label values. Obviously REGEX is a bit more expensive than e.g. STRSTARTS but the difference is not very big. I don't think there's any sorting of predicate values in TDB that would help here. > If my app has an input search box where users can search an item by title (on > a large graph), would it be a good idea to go with 2) or should I consider > setting up a text-query index? I recommend setting up a text index if you want to do partial matching of labels from a large graph. -Osma -- Osma Suominen D.Sc. (Tech), Information Systems Specialist National Library of Finland P.O. Box 26 (Kaikukatu 4) 00014 HELSINGIN YLIOPISTO Tel. +358 50 3199529 [email protected] http://www.nationallibrary.fi
