Re: LARQ prefix search results missing hits

Paolo Castagna Sun, 15 Jul 2012 01:22:19 -0700

Hi Osma,

first of all, thanks for sharing your experience and clearly describingyour problem.

Further comments inline.


On 13/07/12 14:13, Osma Suominen wrote:

Hello!
I'm trying to use a Fuseki SPARQL endpoint together with LARQ tocreate a system for accessing SKOS thesauri. The user interfaceincludes an autocompletion widget. The idea is to use the LARQ indexto make fast prefix queries on the concept labels.
However, I've noticed that in some situations I get less results fromthe index than what I'd expect. This seems to happen when the LARQpart of the query internally produces many hits, such as when doing asingle character prefix query (e.g. ?lit pf:textMatch 'a*').
I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 andLARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQdependency to pom.xml and running mvn package. Other than this issue,Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standardUbuntu packages.
Steps to repeat:

1. package Fuseki with LARQ, as described above

2. start Fuseki with the attached configuration file, i.e.
   ./fuseki-server --config=larq-config.ttl
3. I'm using the STW thesaurus as an easily accessible example dataset (though the problem was originally found with other data sets):
   - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
   - unzip so you have stw.rdf

4. load the thesaurus file into the endpoint:
   ./s-put http://localhost:3030/ds/data default stw.rdf

6. build the LARQ index, e.g. this way:
   - kill Fuseki
   - rm -r /tmp/lucene
   - start Fuseki again, so the index will be built

7. Make SPARQL queries from the web interface at http://localhost:3030

First try this SPARQL query:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
  ?lit pf:textMatch "ar*" .
  ?conc skos:prefLabel ?lit .
  FILTER(REGEX(?lit, '^ar.*', 'i'))
} ORDER BY ?lit

I get 120 hits, including "Arab"@en.
Now try the same query, but change the pf:textMatch argument to "a*".This way I get only 32 results, not including "Arab"@en, even thoughthe shorter prefix query should match a superset of what was matchedby the first query (the regex should still filter it down to the sameresult set).
This issue is not just about single character prefix queries. Withenough data sets loaded into the same index, this happens with longerprefix queries as well.
I think that the problem might be related to Lucene's defaultlimitation of a maximum of 1024 clauses in boolean queries (and thusprefix query matches), as described in the Lucene FAQ:http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F


Yes, I think your hypothesis might be correct (I've not verified it yet).

In case this is the problem, is there any way to tell LARQ to use ahigher BooleanQuery.setMaxClauseCount() value so that this limit isnot triggered? I find it a bit disturbing that hits are silently beinglost. I couldn't see any special output on the Fuseki log.


Not sure about this.

Paolo

Am I doing something wrong? If this is a genuine problem in LARQ, Ican of course make a bug report.
Thanks and best regards,
Osma Suominen

Re: LARQ prefix search results missing hits

Reply via email to