Hi Paolo!

31.08.2012 21:58, Paolo Castagna kirjoitti:

A. Traditional IR, in which the user cares about only the first few
results. Lucene is obviously very good at this, though full advantage
(especially for non-English languages) of it can only be achieved by
using specific Analyzer implementations, which appears not to be
supported in LARQ, at least not without writing some Java code.

B. Speeding up queries on literals for e.g. autocomplete search. While
this can be done without a text index using FILTER(REGEX()), the queries
tend to be quite slow, as the filter is applied only afterwards. In this
case it is important that the text index returns all possible hits, not
just the first ones.
[...]
Do you have any idea/proposal to make LARQ be good for both these
use cases?

For A, I think LARQ is quite good already, though I note that the current implementation is hardcoded to use Lucene StandardAnalyzer which is pretty good for English text, fine for most European languages, but maybe not that great for some other languages. Making it configurable to support other Analyzers such as different language stemmers might be useful. 4store allows a German stemmer to be used, for example [1].

For B, see below.

1. The index is global and contains data from all named graphs mixed up.
This means that when you have many named graphs with different data (as
I do), and try to query only one graph, the LARQ query part will still
return hits from all the other graphs, slowing down later parts of the
query.

Yep.

I though about this while ago, but I haven't actually tried to implement
it. The changes to the index are trivial. The most
difficult part perhaps is on the property function side, but
maybe it's easy that as well.

I think this could be a good contribution, if you need it.

This would we good for my application as it would speed up queries, sometimes by a lot I think. But I'm not that familiar with the Jena codebase so I won't volunteer to implement it...

2. Similarly, the index does not allow filtering by language on the
query level. With multilingual data, you cannot make a query matching
e.g. only English labels but will get hits from all the other languages
as well.

Yep.

I have no proposal for this, but I understand the user need.

I tried a single line change to LARQ.java to support querying by language. Patch attached.

I tested this with the STW thesaurus dataset mentioned in the beginning of this thread. This query against the current unpatched LARQ searches for all concepts whose English language skos:prefLabel begins with A:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
   ?lit pf:textMatch "a*" .
   ?conc skos:prefLabel ?lit .
   FILTER(REGEX(?lit, '^a.*', 'i') && langMatches(LANG(?lit), 'en'))
} ORDER BY ?lit

I benchmarked this query a few dozen times using apachebench. It takes at minimum 35 ms on my machine.

With the patch applied, I can instead use this query:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
   ?lit pf:textMatch "+a* +lang:en" .
   ?conc skos:prefLabel ?lit .
   FILTER(REGEX(?lit, '^a.*', 'i'))
} ORDER BY ?lit

Note that I no longer need to filter the results by language as the index only provides hits with the correct language tag. This query now takes 25ms, so it's about 30% faster than the original. The Lucene index size went from 4352 kb to 4444 kb, a 2% increase.

I admit this is a quite small dataset, but I haven't yet had time to test with larger ones.

What do you think?

A possible refinement would be to support a syntax where the language tag is taken from the literal in the query, e.g.
   ?lit pf:textMatch "a*"@en .


3. The default implementation also doesn't store much context for the
literal, meaning that you cannot restrict the search only to e.g.
skos:prefLabel literal values in skos:Concept type resources. This will
again increase the number of hits returned by the index internally.

I am not sure I follow this or I completely agree with you.

What you say is true, but LARQ provides a property function and you
can use it together with other triple patterns:

  {
    ?l pf:textMatch '...' .
    ?s skos:prefLabel ?l .
    ?s rdf:type skos:Concept .
  }

Now, we can argue on what a clever optimizer should/could do,
but from a point of view of the user, this is quite good and
powerful and it gets you what you want. Isn't it?

The syntax is very easy to remember and the property function
very easy to use.

The Lucene index can be kept quite simple and small.

You're right here, the syntax is perfectly fine. It is only an optimization issue.

There may also be problems with prefix queries if you happen to hit the
default BooleanQuery limit of 1024 clauses, but I haven't yet had this
problem myself with LARQ. Another problem for use case B might be that
the default Lucene StandardAnalyzer, which LARQ seems to use, filters
common English stop words from the index and the query, which might
interfer with the exact matching required for B.

Yep.

Any ideas/proposals?

For the BooleanQuery issue, I would suggest adding this somewhere in the LARQ code:
        BooleanQuery.setMaxClauseCount(newMax)
where newMax is a sufficiently large value (could be 100000 or Integer.MAX_VALUE).

For the other issues, I think use case B would benefit a lot if there was a way to make the field "index" in the Lucene index use a simpler Analyzer such as SimpleAnalyzer or TokenAnalyzer. Or alternatively, perhaps the "lex" field could be processed with another analyzer. For my application, something like LowerCaseKeywordAnalyzer would be perfect, but it doesn't exist in the Lucene distribution. A quick web search finds many such implementations though.

(BTW, I don't quite understand why there's both "index" and "lex" fields in the index, I think one field should be enough for both retrieving exact strings and for performing text searches using keywords).

-Osma

[1] http://4store.org/trac/wiki/TextIndexing


--
Osma Suominen | [email protected] | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing Research Group Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 Aalto, Finland
Index: LARQ.java
===================================================================
--- LARQ.java	(revision 1380183)
+++ LARQ.java	(working copy)
@@ -208,7 +208,7 @@
         
         if ( lang != null )
         {
-            f = new Field(LARQ.fLang, lang, Field.Store.YES, Field.Index.NO) ;
+            f = new Field(LARQ.fLang, lang, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS) ;
             doc.add(f) ;
         }
 

Reply via email to