Re: How to do text search with Jena and Fuseki

Andy Seaborne Wed, 11 Nov 2015 03:24:48 -0800

On 11/11/15 04:40, Kamble, Ajay, Crest wrote:

Thank you Andy for replying.


1. I have a mix of constrained and free text queries. My constrained queries 
(or without free text/normal sparql queries) took 3-10 seconds. Free text 
queries took around 1 second.
     Do you mean that volume of Lucene index will affect constrained queries as 
well?
     At this point I had just included few concepts for Lucene index. Here is 
my configuration:

<#entMap> a text:EntityMap ;
   text:entityField "uri" ;
text:defaultField "text" ;
text:map ( [ text:field "text" ; text:predicate no:concept1 ]


concept1 is a class later one, not property.

If this is an anonymized setup+query, it's not helping in answering thequestion.

  [ text:field "text" ; text:predicate no:concept2 ]
  [ text:field "text" ; text:predicate no:concept3 ]
  [ text:field "text" ; text:predicate no:concept4 ]
  [ text:field "text" ; text:predicate no:concept5 ]
  [ text:field "text" ; text:predicate no:concept6 ] ) .

That uses the same Lucene filed fro each predicate - I'm not sure whatwill happen. At best, it puts all the index text in one field so Lucenehas to process all of them for any lookup.


2. Here is a sample query which takes 10+ seconds to execute. Is there anything 
wrong with this query (or possibility of optimization)?


The Lucene index and regex are unconnected.
The Lucene index is accessed with a property function "text:query"
http://jena.apache.org/documentation/query/text-query.html

PREFIX ex:<http://example.com/ns/concepts#>
PREFIX d:<http://example.com/ns/data#>

SELECT DISTINCT ?a1

DISTINCT can hide a lot of work being done to find many, but few unique,results.

WHERE {
  ?n1 a ex:concept1 ;
  ex:concept2 ?c1 ;


concept as type and concept as property - looks odd to me.

  ex:concept3 ?n2 ;
  ex:concept4 ?f1 ;
  ex:concept5 ?a1 .
  ?c1 ex:concept6 ?cn1 .
  ?f1 ex:concept7 ?fn1 .

Depending on the overall shape of your data, this is huge. It does notstart anywhere so it might well be a scan of a lot of the database.

What's more multiple occurrences of properties on the same subject willlead to fan out causing duplication of ?a1, then hidden by the DISTINCT.

  FILTER (regex(?n2, "^word1", "i"))
  FILTER (regex(?cn1, "^word2$", "i"))
  FILTER (regex(?fn1, "^word3$", "i")) }

The way this query will execute is that the pattern part is executed,probably generating lot matches with a lot of duplication of ?a1, andthe filters used to test the results. Filters are pushed to the bestplace but there is only so much they can do.


Better might be:
(after sorting out the reuse of one field in the lucene index)

  # Look for all ?n2 of interest by concept2 in Lucene:
  ?n2 text:query (ex:concept2 "word1") .

  # Then do pattern matching only for those ?n2
  ?n1 ex:concept3 ?n2 .
      ex:concept2 ?c1 ;
      ex:concept4 ?f1 ;
      ex:concept5 ?a1 .
  ?c1 ex:concept6 ?cn1 .
  ?f1 ex:concept7 ?fn1 .
  # Checks
  FILTER (regex(?cn1, "^word2$", "i"))
  FILTER (regex(?fn1, "^word3$", "i")) }

You can start at word2 or word3 similarly - use the one with the lastlikely matches.

You may need to keep the FILTERs if the way you get Lucene matches ismore general than the regex version (e.g. stemming matters).


        Andy


3. About Hardware, right now I am just running this on my MacBook Pro with 2.5 
GHz Intel Core i7 and 16 GB of RAM.

It would be great if you could give me some suggestions or point me to any 
resource that explains Fuseki optimization.

Re: How to do text search with Jena and Fuseki

Reply via email to