Re: LARQ prefix search results missing hits

Osma Suominen Mon, 20 Aug 2012 03:11:10 -0700

Hi Paolo!

Thanks for your quick reply.


17.08.2012 20:16, Paolo Castagna wrote:

Does your problem go away without changing the code and using:
?lit pf:textMatch ( 'a*' 100000 )

I tested this but it didn't help. If I use a parameter less than 1000then I get even fewer hits, but values above 1000 don't have any effect.


I think the problem is this line in IndexLARQ.java:

TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ;

As you can see the parameter for maximum number of hits is takendirectly from the NUM_RESULTS constant. The value specified in the queryhas no effect on this level.

It's not a problem adding a couple of '0'...
However, I am thinking that this would just shift the problem, isn't it?

You're right, it would just shift the problem but a sufficiently largevalue could be used that never caused problems in practice. Maybe youcould consider NUM_RESULTS = Integer.MAX_VALUE ? :)

Or maybe LARQ should use another variant of Lucene'sIndexSearcher.search(), one which takes a Collector object instead ofthe integer n parameter. E.g. this:

http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29

Thanks,
Osma

On 15/08/12 10:31, Osma Suominen wrote:

Hi Paolo!

Thanks for your reply and sorry for the delay.

I tested this again with today's svn snapshot and it's still a problem.

However, after digging a bit further I found this in
jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:

--clip--
     // The number of results returned by default
     public static final int NUM_RESULTS             = 1000 ; // should
we increase this? -- PC
--clip--

I changed NUM_RESULTS to 100000 (added two zeros), built and installed
my modified LARQ with mvn install (NB this required tweaking arq.ver
and tdb.ver in jena-larq/pom.xml to match the current svn versions),
rebuilt Fuseki and now the problem is gone!

I would suggest that this constant be increased to something larger
than 1000. Based on the code comment, you seem to have had similar
thoughts sometime in the past :)

Thanks,
Osma


15.07.2012 11:21, Paolo Castagna kirjoitti:

Hi Osma,
first of all, thanks for sharing your experience and clearly describing
your problem.
Further comments inline.

On 13/07/12 14:13, Osma Suominen wrote:

Hello!

I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
create a system for accessing SKOS thesauri. The user interface
includes an autocompletion widget. The idea is to use the LARQ index
to make fast prefix queries on the concept labels.

However, I've noticed that in some situations I get less results from
the index than what I'd expect. This seems to happen when the LARQ
part of the query internally produces many hits, such as when doing a
single character prefix query (e.g. ?lit pf:textMatch 'a*').

I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
dependency to pom.xml and running mvn package. Other than this issue,
Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
Ubuntu packages.

Steps to repeat:

1. package Fuseki with LARQ, as described above

2. start Fuseki with the attached configuration file, i.e.
./fuseki-server --config=larq-config.ttl

3. I'm using the STW thesaurus as an easily accessible example data
set (though the problem was originally found with other data sets):
- download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
- unzip so you have stw.rdf

4. load the thesaurus file into the endpoint:
./s-put http://localhost:3030/ds/data default stw.rdf

6. build the LARQ index, e.g. this way:
- kill Fuseki
- rm -r /tmp/lucene
- start Fuseki again, so the index will be built

7. Make SPARQL queries from the web interface at http://localhost:3030

First try this SPARQL query:

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
?lit pf:textMatch "ar*" .
?conc skos:prefLabel ?lit .
FILTER(REGEX(?lit, '^ar.*', 'i'))
} ORDER BY ?lit

I get 120 hits, including "Arab"@en.

Now try the same query, but change the pf:textMatch argument to "a*".
This way I get only 32 results, not including "Arab"@en, even though
the shorter prefix query should match a superset of what was matched
by the first query (the regex should still filter it down to the same
result set).

This issue is not just about single character prefix queries. With
enough data sets loaded into the same index, this happens with longer
prefix queries as well.

I think that the problem might be related to Lucene's default
limitation of a maximum of 1024 clauses in boolean queries (and thus
prefix query matches), as described in the Lucene FAQ:
http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F


Yes, I think your hypothesis might be correct (I've not verified it
yet).

In case this is the problem, is there any way to tell LARQ to use a
higher BooleanQuery.setMaxClauseCount() value so that this limit is
not triggered? I find it a bit disturbing that hits are silently being
lost. I couldn't see any special output on the Fuseki log.


Not sure about this.

Paolo


Am I doing something wrong? If this is a genuine problem in LARQ, I
can of course make a bug report.


Thanks and best regards,
Osma Suominen



--
Osma Suominen | [email protected] | +358 40 5255 882

Aalto University, Department of Media Technology, Semantic ComputingResearch GroupRoom 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076Aalto, Finland

Re: LARQ prefix search results missing hits

Reply via email to