Hi Paolo!
31.08.2012 21:58, Paolo Castagna kirjoitti:
A. Traditional IR, in which the user cares about only the first few
results. Lucene is obviously very good at this, though full advantage
(especially for non-English languages) of it can only be achieved by
using specific Analyzer implementations, which appears not to be
supported in LARQ, at least not without writing some Java code.
B. Speeding up queries on literals for e.g. autocomplete search. While
this can be done without a text index using FILTER(REGEX()), the queries
tend to be quite slow, as the filter is applied only afterwards. In this
case it is important that the text index returns all possible hits, not
just the first ones.
[...]
Do you have any idea/proposal to make LARQ be good for both these
use cases?
For A, I think LARQ is quite good already, though I note that the
current implementation is hardcoded to use Lucene StandardAnalyzer which
is pretty good for English text, fine for most European languages, but
maybe not that great for some other languages. Making it configurable to
support other Analyzers such as different language stemmers might be
useful. 4store allows a German stemmer to be used, for example [1].
For B, see below.
1. The index is global and contains data from all named graphs mixed up.
This means that when you have many named graphs with different data (as
I do), and try to query only one graph, the LARQ query part will still
return hits from all the other graphs, slowing down later parts of the
query.
Yep.
I though about this while ago, but I haven't actually tried to implement
it. The changes to the index are trivial. The most
difficult part perhaps is on the property function side, but
maybe it's easy that as well.
I think this could be a good contribution, if you need it.
This would we good for my application as it would speed up queries,
sometimes by a lot I think. But I'm not that familiar with the Jena
codebase so I won't volunteer to implement it...
2. Similarly, the index does not allow filtering by language on the
query level. With multilingual data, you cannot make a query matching
e.g. only English labels but will get hits from all the other languages
as well.
Yep.
I have no proposal for this, but I understand the user need.
I tried a single line change to LARQ.java to support querying by
language. Patch attached.
I tested this with the STW thesaurus dataset mentioned in the beginning
of this thread. This query against the current unpatched LARQ searches
for all concepts whose English language skos:prefLabel begins with A:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
?lit pf:textMatch "a*" .
?conc skos:prefLabel ?lit .
FILTER(REGEX(?lit, '^a.*', 'i') && langMatches(LANG(?lit), 'en'))
} ORDER BY ?lit
I benchmarked this query a few dozen times using apachebench. It takes
at minimum 35 ms on my machine.
With the patch applied, I can instead use this query:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
?lit pf:textMatch "+a* +lang:en" .
?conc skos:prefLabel ?lit .
FILTER(REGEX(?lit, '^a.*', 'i'))
} ORDER BY ?lit
Note that I no longer need to filter the results by language as the
index only provides hits with the correct language tag. This query now
takes 25ms, so it's about 30% faster than the original. The Lucene index
size went from 4352 kb to 4444 kb, a 2% increase.
I admit this is a quite small dataset, but I haven't yet had time to
test with larger ones.
What do you think?
A possible refinement would be to support a syntax where the language
tag is taken from the literal in the query, e.g.
?lit pf:textMatch "a*"@en .
3. The default implementation also doesn't store much context for the
literal, meaning that you cannot restrict the search only to e.g.
skos:prefLabel literal values in skos:Concept type resources. This will
again increase the number of hits returned by the index internally.
I am not sure I follow this or I completely agree with you.
What you say is true, but LARQ provides a property function and you
can use it together with other triple patterns:
{
?l pf:textMatch '...' .
?s skos:prefLabel ?l .
?s rdf:type skos:Concept .
}
Now, we can argue on what a clever optimizer should/could do,
but from a point of view of the user, this is quite good and
powerful and it gets you what you want. Isn't it?
The syntax is very easy to remember and the property function
very easy to use.
The Lucene index can be kept quite simple and small.
You're right here, the syntax is perfectly fine. It is only an
optimization issue.
There may also be problems with prefix queries if you happen to hit the
default BooleanQuery limit of 1024 clauses, but I haven't yet had this
problem myself with LARQ. Another problem for use case B might be that
the default Lucene StandardAnalyzer, which LARQ seems to use, filters
common English stop words from the index and the query, which might
interfer with the exact matching required for B.
Yep.
Any ideas/proposals?
For the BooleanQuery issue, I would suggest adding this somewhere in the
LARQ code:
BooleanQuery.setMaxClauseCount(newMax)
where newMax is a sufficiently large value (could be 100000 or
Integer.MAX_VALUE).
For the other issues, I think use case B would benefit a lot if there
was a way to make the field "index" in the Lucene index use a simpler
Analyzer such as SimpleAnalyzer or TokenAnalyzer. Or alternatively,
perhaps the "lex" field could be processed with another analyzer. For my
application, something like LowerCaseKeywordAnalyzer would be perfect,
but it doesn't exist in the Lucene distribution. A quick web search
finds many such implementations though.
(BTW, I don't quite understand why there's both "index" and "lex" fields
in the index, I think one field should be enough for both retrieving
exact strings and for performing text searches using keywords).
-Osma
[1] http://4store.org/trac/wiki/TextIndexing
--
Osma Suominen | [email protected] | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076
Aalto, Finland
Index: LARQ.java
===================================================================
--- LARQ.java (revision 1380183)
+++ LARQ.java (working copy)
@@ -208,7 +208,7 @@
if ( lang != null )
{
- f = new Field(LARQ.fLang, lang, Field.Store.YES, Field.Index.NO) ;
+ f = new Field(LARQ.fLang, lang, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS) ;
doc.add(f) ;
}