Re: LARQ prefix search results missing hits

Osma Suominen Tue, 28 Aug 2012 06:22:45 -0700

Hi Paolo!

Thanks a lot for the fix! I have tested the latest snapshot and it nowworks as expected. At least until I add lots of new data and hit the newlimit :)

You're of course right about the search use case. I think the problemhere is that the LARQ index can be used for two very different use cases:

A. Traditional IR, in which the user cares about only the first fewresults. Lucene is obviously very good at this, though full advantage(especially for non-English languages) of it can only be achieved byusing specific Analyzer implementations, which appears not to besupported in LARQ, at least not without writing some Java code.

B. Speeding up queries on literals for e.g. autocomplete search. Whilethis can be done without a text index using FILTER(REGEX()), the queriestend to be quite slow, as the filter is applied only afterwards. In thiscase it is important that the text index returns all possible hits, notjust the first ones.

I have no idea which is the more important use case for LARQ, but I'mcurrently only interested in B because of the requirements of theapplication I'm building (ONKI Light, a SKOS vocabulary browser forSPARQL endpoints).

Currently the benefits of LARQ (at least for the out-of-the-boxconfiguration for Fuseki+LARQ) for both A and B are somewhat diminishedby these limitations:

1. The index is global and contains data from all named graphs mixed up.This means that when you have many named graphs with different data (asI do), and try to query only one graph, the LARQ query part will stillreturn hits from all the other graphs, slowing down later parts of thequery.

2. Similarly, the index does not allow filtering by language on thequery level. With multilingual data, you cannot make a query matchinge.g. only English labels but will get hits from all the other languagesas well.

3. The default implementation also doesn't store much context for theliteral, meaning that you cannot restrict the search only to e.g.skos:prefLabel literal values in skos:Concept type resources. This willagain increase the number of hits returned by the index internally.

There may also be problems with prefix queries if you happen to hit thedefault BooleanQuery limit of 1024 clauses, but I haven't yet had thisproblem myself with LARQ. Another problem for use case B might be thatthe default Lucene StandardAnalyzer, which LARQ seems to use, filterscommon English stop words from the index and the query, which mightinterfer with the exact matching required for B.

To be fair, other SPARQL text index implementations are not that goodfor prefix searches either. Virtuoso [1] requires at least 4 characterprefixes to be specified (this can be changed by recompiling). AFAICTthe 4store text index [2] doesn't support prefix queries at all, as theindex structure requires whole words to be used (though possibly somecreative use of subqueries and FILTER(REGEX()) could be used to stillget some benefit of the index).


Osma

[1]http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext

[2] http://4store.org/trac/wiki/TextIndexing

26.08.2012 22:49, Paolo Castagna wrote:

Hi Osma

On 20/08/12 11:10, Osma Suominen wrote:

Hi Paolo!

Thanks for your quick reply.

17.08.2012 20:16, Paolo Castagna wrote:

Does your problem go away without changing the code and using:
?lit pf:textMatch ( 'a*' 100000 )


I tested this but it didn't help. If I use a parameter less than 1000
then I get even fewer hits, but values above 1000 don't have any effect.


Right.

I think the problem is this line in IndexLARQ.java:

TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ;

As you can see the parameter for maximum number of hits is taken
directly from the NUM_RESULTS constant. The value specified in the query
has no effect on this level.


Correct.

It's not a problem adding a couple of '0'...
However, I am thinking that this would just shift the problem, isn't it?


You're right, it would just shift the problem but a sufficiently large
value could be used that never caused problems in practice. Maybe you
could consider NUM_RESULTS = Integer.MAX_VALUE ? :)


A lot of use cases about search are to used to drive a UI for people and
often only the first few results are necessary.

Try to continue hit 'next >>' on Google, how many results can you get?

;-)

Anyway, I increased the NUM_RESULT constant.

Or maybe LARQ should use another variant of Lucene's
IndexSearcher.search(), one which takes a Collector object instead of
the integer n parameter. E.g. this:
http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29


Yes. That would be the thing to use if we want to retrieve all the
results from Lucene.

More thinking is necessary here...

In the meantime, you can find a LARQ SNAPSHOT here:
https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/

Paolo



Thanks,
Osma

On 15/08/12 10:31, Osma Suominen wrote:

Hi Paolo!

Thanks for your reply and sorry for the delay.

I tested this again with today's svn snapshot and it's still a problem.

However, after digging a bit further I found this in
jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:

--clip--
      // The number of results returned by default
      public static final int NUM_RESULTS             = 1000 ; // should
we increase this? -- PC
--clip--

I changed NUM_RESULTS to 100000 (added two zeros), built and installed
my modified LARQ with mvn install (NB this required tweaking arq.ver
and tdb.ver in jena-larq/pom.xml to match the current svn versions),
rebuilt Fuseki and now the problem is gone!

I would suggest that this constant be increased to something larger
than 1000. Based on the code comment, you seem to have had similar
thoughts sometime in the past :)

Thanks,
Osma


15.07.2012 11:21, Paolo Castagna kirjoitti:

Hi Osma,
first of all, thanks for sharing your experience and clearly describing
your problem.
Further comments inline.

On 13/07/12 14:13, Osma Suominen wrote:

Hello!

I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
create a system for accessing SKOS thesauri. The user interface
includes an autocompletion widget. The idea is to use the LARQ index
to make fast prefix queries on the concept labels.

However, I've noticed that in some situations I get less results from
the index than what I'd expect. This seems to happen when the LARQ
part of the query internally produces many hits, such as when doing a
single character prefix query (e.g. ?lit pf:textMatch 'a*').

I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
dependency to pom.xml and running mvn package. Other than this issue,
Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
Ubuntu packages.

Steps to repeat:

1. package Fuseki with LARQ, as described above

2. start Fuseki with the attached configuration file, i.e.
./fuseki-server --config=larq-config.ttl

3. I'm using the STW thesaurus as an easily accessible example data
set (though the problem was originally found with other data sets):
- download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
- unzip so you have stw.rdf

4. load the thesaurus file into the endpoint:
./s-put http://localhost:3030/ds/data default stw.rdf

6. build the LARQ index, e.g. this way:
- kill Fuseki
- rm -r /tmp/lucene
- start Fuseki again, so the index will be built

7. Make SPARQL queries from the web interface at http://localhost:3030

First try this SPARQL query:

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
?lit pf:textMatch "ar*" .
?conc skos:prefLabel ?lit .
FILTER(REGEX(?lit, '^ar.*', 'i'))
} ORDER BY ?lit

I get 120 hits, including "Arab"@en.

Now try the same query, but change the pf:textMatch argument to "a*".
This way I get only 32 results, not including "Arab"@en, even though
the shorter prefix query should match a superset of what was matched
by the first query (the regex should still filter it down to the same
result set).

This issue is not just about single character prefix queries. With
enough data sets loaded into the same index, this happens with longer
prefix queries as well.

I think that the problem might be related to Lucene's default
limitation of a maximum of 1024 clauses in boolean queries (and thus
prefix query matches), as described in the Lucene FAQ:
http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F


Yes, I think your hypothesis might be correct (I've not verified it
yet).

In case this is the problem, is there any way to tell LARQ to use a
higher BooleanQuery.setMaxClauseCount() value so that this limit is
not triggered? I find it a bit disturbing that hits are silently being
lost. I couldn't see any special output on the Fuseki log.


Not sure about this.

Paolo


Am I doing something wrong? If this is a genuine problem in LARQ, I
can of course make a bug report.


Thanks and best regards,
Osma Suominen



--
Osma Suominen | [email protected] | +358 40 5255 882

Aalto University, Department of Media Technology, Semantic ComputingResearch GroupRoom 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076Aalto, Finland

Re: LARQ prefix search results missing hits

Reply via email to