Re: LARQ prefix search results missing hits

Osma Suominen Mon, 03 Sep 2012 03:50:04 -0700

Hi Paolo!

31.08.2012 21:58, Paolo Castagna kirjoitti:

A. Traditional IR, in which the user cares about only the first few
results. Lucene is obviously very good at this, though full advantage
(especially for non-English languages) of it can only be achieved by
using specific Analyzer implementations, which appears not to be
supported in LARQ, at least not without writing some Java code.

B. Speeding up queries on literals for e.g. autocomplete search. While
this can be done without a text index using FILTER(REGEX()), the queries
tend to be quite slow, as the filter is applied only afterwards. In this
case it is important that the text index returns all possible hits, not
just the first ones.

[...]

Do you have any idea/proposal to make LARQ be good for both these
use cases?

For A, I think LARQ is quite good already, though I note that thecurrent implementation is hardcoded to use Lucene StandardAnalyzer whichis pretty good for English text, fine for most European languages, butmaybe not that great for some other languages. Making it configurable tosupport other Analyzers such as different language stemmers might beuseful. 4store allows a German stemmer to be used, for example [1].


For B, see below.

1. The index is global and contains data from all named graphs mixed up.
This means that when you have many named graphs with different data (as
I do), and try to query only one graph, the LARQ query part will still
return hits from all the other graphs, slowing down later parts of the
query.


Yep.

I though about this while ago, but I haven't actually tried to implement
it. The changes to the index are trivial. The most
difficult part perhaps is on the property function side, but
maybe it's easy that as well.

I think this could be a good contribution, if you need it.

This would we good for my application as it would speed up queries,sometimes by a lot I think. But I'm not that familiar with the Jenacodebase so I won't volunteer to implement it...

2. Similarly, the index does not allow filtering by language on the
query level. With multilingual data, you cannot make a query matching
e.g. only English labels but will get hits from all the other languages
as well.


Yep.

I have no proposal for this, but I understand the user need.

I tried a single line change to LARQ.java to support querying bylanguage. Patch attached.

I tested this with the STW thesaurus dataset mentioned in the beginningof this thread. This query against the current unpatched LARQ searchesfor all concepts whose English language skos:prefLabel begins with A:


PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
   ?lit pf:textMatch "a*" .
   ?conc skos:prefLabel ?lit .
   FILTER(REGEX(?lit, '^a.*', 'i') && langMatches(LANG(?lit), 'en'))
} ORDER BY ?lit

I benchmarked this query a few dozen times using apachebench. It takesat minimum 35 ms on my machine.


With the patch applied, I can instead use this query:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
   ?lit pf:textMatch "+a* +lang:en" .
   ?conc skos:prefLabel ?lit .
   FILTER(REGEX(?lit, '^a.*', 'i'))
} ORDER BY ?lit

Note that I no longer need to filter the results by language as theindex only provides hits with the correct language tag. This query nowtakes 25ms, so it's about 30% faster than the original. The Lucene indexsize went from 4352 kb to 4444 kb, a 2% increase.

I admit this is a quite small dataset, but I haven't yet had time totest with larger ones.


What do you think?

A possible refinement would be to support a syntax where the languagetag is taken from the literal in the query, e.g.

   ?lit pf:textMatch "a*"@en .

3. The default implementation also doesn't store much context for the
literal, meaning that you cannot restrict the search only to e.g.
skos:prefLabel literal values in skos:Concept type resources. This will
again increase the number of hits returned by the index internally.


I am not sure I follow this or I completely agree with you.

What you say is true, but LARQ provides a property function and you
can use it together with other triple patterns:

  {
    ?l pf:textMatch '...' .
    ?s skos:prefLabel ?l .
    ?s rdf:type skos:Concept .
  }

Now, we can argue on what a clever optimizer should/could do,
but from a point of view of the user, this is quite good and
powerful and it gets you what you want. Isn't it?

The syntax is very easy to remember and the property function
very easy to use.

The Lucene index can be kept quite simple and small.

You're right here, the syntax is perfectly fine. It is only anoptimization issue.

There may also be problems with prefix queries if you happen to hit the
default BooleanQuery limit of 1024 clauses, but I haven't yet had this
problem myself with LARQ. Another problem for use case B might be that
the default Lucene StandardAnalyzer, which LARQ seems to use, filters
common English stop words from the index and the query, which might
interfer with the exact matching required for B.


Yep.

Any ideas/proposals?

For the BooleanQuery issue, I would suggest adding this somewhere in theLARQ code:

        BooleanQuery.setMaxClauseCount(newMax)

where newMax is a sufficiently large value (could be 100000 orInteger.MAX_VALUE).

For the other issues, I think use case B would benefit a lot if therewas a way to make the field "index" in the Lucene index use a simplerAnalyzer such as SimpleAnalyzer or TokenAnalyzer. Or alternatively,perhaps the "lex" field could be processed with another analyzer. For myapplication, something like LowerCaseKeywordAnalyzer would be perfect,but it doesn't exist in the Lucene distribution. A quick web searchfinds many such implementations though.

(BTW, I don't quite understand why there's both "index" and "lex" fieldsin the index, I think one field should be enough for both retrievingexact strings and for performing text searches using keywords).


-Osma

[1] http://4store.org/trac/wiki/TextIndexing


--
Osma Suominen | [email protected] | +358 40 5255 882

Aalto University, Department of Media Technology, Semantic ComputingResearch GroupRoom 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076Aalto, Finland

Index: LARQ.java
===================================================================
--- LARQ.java	(revision 1380183)
+++ LARQ.java	(working copy)
@@ -208,7 +208,7 @@
         
         if ( lang != null )
         {
-            f = new Field(LARQ.fLang, lang, Field.Store.YES, Field.Index.NO) ;
+            f = new Field(LARQ.fLang, lang, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS) ;
             doc.add(f) ;
         }

Re: LARQ prefix search results missing hits

Reply via email to