On 11/11/15 04:40, Kamble, Ajay, Crest wrote:
Thank you Andy for replying.
1. I have a mix of constrained and free text queries. My constrained queries
(or without free text/normal sparql queries) took 3-10 seconds. Free text
queries took around 1 second.
Do you mean that volume of Lucene index will affect constrained queries as
well?
At this point I had just included few concepts for Lucene index. Here is
my configuration:
<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ;
text:map ( [ text:field "text" ; text:predicate no:concept1 ]
concept1 is a class later one, not property.
If this is an anonymized setup+query, it's not helping in answering the
question.
[ text:field "text" ; text:predicate no:concept2 ]
[ text:field "text" ; text:predicate no:concept3 ]
[ text:field "text" ; text:predicate no:concept4 ]
[ text:field "text" ; text:predicate no:concept5 ]
[ text:field "text" ; text:predicate no:concept6 ] ) .
That uses the same Lucene filed fro each predicate - I'm not sure what
will happen. At best, it puts all the index text in one field so Lucene
has to process all of them for any lookup.
2. Here is a sample query which takes 10+ seconds to execute. Is there anything
wrong with this query (or possibility of optimization)?
The Lucene index and regex are unconnected.
The Lucene index is accessed with a property function "text:query"
http://jena.apache.org/documentation/query/text-query.html
PREFIX ex:<http://example.com/ns/concepts#>
PREFIX d:<http://example.com/ns/data#>
SELECT DISTINCT ?a1
DISTINCT can hide a lot of work being done to find many, but few unique,
results.
WHERE {
?n1 a ex:concept1 ;
ex:concept2 ?c1 ;
concept as type and concept as property - looks odd to me.
ex:concept3 ?n2 ;
ex:concept4 ?f1 ;
ex:concept5 ?a1 .
?c1 ex:concept6 ?cn1 .
?f1 ex:concept7 ?fn1 .
Depending on the overall shape of your data, this is huge. It does not
start anywhere so it might well be a scan of a lot of the database.
What's more multiple occurrences of properties on the same subject will
lead to fan out causing duplication of ?a1, then hidden by the DISTINCT.
FILTER (regex(?n2, "^word1", "i"))
FILTER (regex(?cn1, "^word2$", "i"))
FILTER (regex(?fn1, "^word3$", "i")) }
The way this query will execute is that the pattern part is executed,
probably generating lot matches with a lot of duplication of ?a1, and
the filters used to test the results. Filters are pushed to the best
place but there is only so much they can do.
Better might be:
(after sorting out the reuse of one field in the lucene index)
# Look for all ?n2 of interest by concept2 in Lucene:
?n2 text:query (ex:concept2 "word1") .
# Then do pattern matching only for those ?n2
?n1 ex:concept3 ?n2 .
ex:concept2 ?c1 ;
ex:concept4 ?f1 ;
ex:concept5 ?a1 .
?c1 ex:concept6 ?cn1 .
?f1 ex:concept7 ?fn1 .
# Checks
FILTER (regex(?cn1, "^word2$", "i"))
FILTER (regex(?fn1, "^word3$", "i")) }
You can start at word2 or word3 similarly - use the one with the last
likely matches.
You may need to keep the FILTERs if the way you get Lucene matches is
more general than the regex version (e.g. stemming matters).
Andy
3. About Hardware, right now I am just running this on my MacBook Pro with 2.5
GHz Intel Core i7 and 16 GB of RAM.
It would be great if you could give me some suggestions or point me to any
resource that explains Fuseki optimization.