Hi Osma,
first of all, thanks for sharing your experience and clearly describing your problem.
Further comments inline.

On 13/07/12 14:13, Osma Suominen wrote:
Hello!

I'm trying to use a Fuseki SPARQL endpoint together with LARQ to create a system for accessing SKOS thesauri. The user interface includes an autocompletion widget. The idea is to use the LARQ index to make fast prefix queries on the concept labels.

However, I've noticed that in some situations I get less results from the index than what I'd expect. This seems to happen when the LARQ part of the query internally produces many hits, such as when doing a single character prefix query (e.g. ?lit pf:textMatch 'a*').

I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ dependency to pom.xml and running mvn package. Other than this issue, Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard Ubuntu packages.


Steps to repeat:

1. package Fuseki with LARQ, as described above

2. start Fuseki with the attached configuration file, i.e.
   ./fuseki-server --config=larq-config.ttl

3. I'm using the STW thesaurus as an easily accessible example data set (though the problem was originally found with other data sets):
   - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
   - unzip so you have stw.rdf

4. load the thesaurus file into the endpoint:
   ./s-put http://localhost:3030/ds/data default stw.rdf

6. build the LARQ index, e.g. this way:
   - kill Fuseki
   - rm -r /tmp/lucene
   - start Fuseki again, so the index will be built

7. Make SPARQL queries from the web interface at http://localhost:3030

First try this SPARQL query:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
  ?lit pf:textMatch "ar*" .
  ?conc skos:prefLabel ?lit .
  FILTER(REGEX(?lit, '^ar.*', 'i'))
} ORDER BY ?lit

I get 120 hits, including "Arab"@en.

Now try the same query, but change the pf:textMatch argument to "a*". This way I get only 32 results, not including "Arab"@en, even though the shorter prefix query should match a superset of what was matched by the first query (the regex should still filter it down to the same result set).


This issue is not just about single character prefix queries. With enough data sets loaded into the same index, this happens with longer prefix queries as well.

I think that the problem might be related to Lucene's default limitation of a maximum of 1024 clauses in boolean queries (and thus prefix query matches), as described in the Lucene FAQ: http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F

Yes, I think your hypothesis might be correct (I've not verified it yet).

In case this is the problem, is there any way to tell LARQ to use a higher BooleanQuery.setMaxClauseCount() value so that this limit is not triggered? I find it a bit disturbing that hits are silently being lost. I couldn't see any special output on the Fuseki log.

Not sure about this.

Paolo


Am I doing something wrong? If this is a genuine problem in LARQ, I can of course make a bug report.


Thanks and best regards,
Osma Suominen



Reply via email to