Re: How to do text search with Jena and Fuseki

Kamble, Ajay, Crest Thu, 12 Nov 2015 02:33:28 -0800

Hello Andy,

I did not completely understand your feedback.


Is there any advantage if I use different fields for different predicates, 
given my intention is to find matches on any predicate. So, if I search for 
“foo”, I want matches from bio, qualification and everything in single query.

Here is my sample query,

PREFIX text: <http://jena.apache.org/text#>
PREFIX no: <http://nano.springer.com/ns/nanoobjects#>
PREFIX d: <http://nano.springer.com/ns/data#>

SELECT ?s ?score ?what
{
    (?s ?score) text:query 'gold nanoparticles' .
    ?s a ?what .
}

-Ajay


On Nov 12, 2015, at 12:26 AM, Andy Seaborne 
<[email protected]<mailto:[email protected]>> wrote:

On 11/11/15 16:30, Kamble, Ajay, Crest wrote:
Thank you Andy for reply.

1. Performance: I was able to solve it by ordering the triples correctly. I 
read a chapter in ‘Learning Sparql’ book on optimization. The problem in my 
query was that I started with a large set, for example give me all things A, 
then their Bs and use filter on B. Better option is give me all things B which 
have filter then their As. After this tuning all queries now return in under 1 
second, which is great.

2. I am trying to understand your feedback on Lucene index. Apology for not 
giving actual code, but here is a better representation.

<#entMap> a text:EntityMap ;
 text:entityField "uri" ;
 text:defaultField "text" ;
 text:map (
 [ text:field "text" ; text:predicate no:name ]
 [ text:field "text" ; text:predicate no:address ]
 [ text:field "text" ; text:predicate no:bio ]
 [ text:field "text" ; text:predicate no:qualification ]
 [ text:field "text" ; text:predicate no:hobbies ]
) .

I want the ability to do a free text search over all properties name, address, 
bio, qualification, hobbies in single query. Considering this is there anything 
wrong with my configuration?

have you considered having different fields for different predicates?

<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "name" ;
text:map (
  [ text:field "name" ;    text:predicate no:name ]
  [ text:field "address" ; text:predicate no:address ]
  [ text:field "bio" ;     text:predicate no:bio ]
  [ text:field "qualification" ; text:predicate no:qualification ]
  [ text:field "hobbies" ; text:predicate no:hobbies ]
) .

then you can search by predicate.

?uri text:query (no:address 'Road') .

as you have it, searching by "foo" returns multiple matches if "foo" is in, say 
bio and qualification.

Andy



-Ajay

On Nov 11, 2015, at 4:54 PM, Andy Seaborne 
<[email protected]<mailto:[email protected]>> wrote:

On 11/11/15 04:40, Kamble, Ajay, Crest wrote:
Thank you Andy for replying.

1. I have a mix of constrained and free text queries. My constrained queries 
(or without free text/normal sparql queries) took 3-10 seconds. Free text 
queries took around 1 second.
    Do you mean that volume of Lucene index will affect constrained queries as 
well?
    At this point I had just included few concepts for Lucene index. Here is my 
configuration:

<#entMap> a text:EntityMap ;
  text:entityField "uri" ;
text:defaultField "text" ;
text:map ( [ text:field "text" ; text:predicate no:concept1 ]

concept1 is a class later one, not property.

If this is an anonymized setup+query, it's not helping in answering the 
question.

 [ text:field "text" ; text:predicate no:concept2 ]
 [ text:field "text" ; text:predicate no:concept3 ]
 [ text:field "text" ; text:predicate no:concept4 ]
 [ text:field "text" ; text:predicate no:concept5 ]
 [ text:field "text" ; text:predicate no:concept6 ] ) .

That uses the same Lucene filed fro each predicate - I'm not sure what will 
happen.  At best, it puts all the index text in one field so Lucene has to 
process all of them for any lookup.


2. Here is a sample query which takes 10+ seconds to execute. Is there anything 
wrong with this query (or possibility of optimization)?

The Lucene index and regex are unconnected.
The Lucene index is accessed with a property function "text:query"
http://jena.apache.org/documentation/query/text-query.html

PREFIX ex:<http://example.com/ns/concepts#>
PREFIX d:<http://example.com/ns/data#>

SELECT DISTINCT ?a1

DISTINCT can hide a lot of work being done to find many, but few unique, 
results.

WHERE {
 ?n1 a ex:concept1 ;
 ex:concept2 ?c1 ;

concept as type and concept as property - looks odd to me.

 ex:concept3 ?n2 ;
 ex:concept4 ?f1 ;
 ex:concept5 ?a1 .
 ?c1 ex:concept6 ?cn1 .
 ?f1 ex:concept7 ?fn1 .

Depending on the overall shape of your data, this is huge.  It does not start 
anywhere so it might well be a scan of a lot of the database.

What's more multiple occurrences of properties on the same subject will lead to 
fan out causing duplication of ?a1, then hidden by the DISTINCT.

 FILTER (regex(?n2, "^word1", "i"))
 FILTER (regex(?cn1, "^word2$", "i"))
 FILTER (regex(?fn1, "^word3$", "i")) }

The way this query will execute is that the pattern part is executed, probably 
generating lot matches with a lot of duplication of ?a1, and the filters used 
to test the results. Filters are pushed to the best place but there is only so 
much they can do.

Better might be:
(after sorting out the reuse of one field in the lucene index)

 # Look for all ?n2 of interest by concept2 in Lucene:
 ?n2 text:query (ex:concept2 "word1") .

 # Then do pattern matching only for those ?n2
 ?n1 ex:concept3 ?n2 .
     ex:concept2 ?c1 ;
     ex:concept4 ?f1 ;
     ex:concept5 ?a1 .
 ?c1 ex:concept6 ?cn1 .
 ?f1 ex:concept7 ?fn1 .
 # Checks
 FILTER (regex(?cn1, "^word2$", "i"))
 FILTER (regex(?fn1, "^word3$", "i")) }

You can start at word2 or word3 similarly - use the one with the last likely 
matches.

You may need to keep the FILTERs if the way you get Lucene matches is more 
general than the regex version (e.g. stemming matters).

Andy


3. About Hardware, right now I am just running this on my MacBook Pro with 2.5 
GHz Intel Core i7 and 16 GB of RAM.

It would be great if you could give me some suggestions or point me to any 
resource that explains Fuseki optimization.

Re: How to do text search with Jena and Fuseki

Reply via email to