LARQ prefix search results missing hits

Osma Suominen Fri, 13 Jul 2012 06:14:28 -0700

Hello!

I'm trying to use a Fuseki SPARQL endpoint together with LARQ to createa system for accessing SKOS thesauri. The user interface includes anautocompletion widget. The idea is to use the LARQ index to make fastprefix queries on the concept labels.

However, I've noticed that in some situations I get less results fromthe index than what I'd expect. This seems to happen when the LARQ partof the query internally produces many hits, such as when doing a singlecharacter prefix query (e.g. ?lit pf:textMatch 'a*').

I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 andLARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQdependency to pom.xml and running mvn package. Other than this issue,Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux 12.04LTS amd64 with OpenJDK 1.6.0_24 installed from the standard Ubuntu packages.



Steps to repeat:

1. package Fuseki with LARQ, as described above

2. start Fuseki with the attached configuration file, i.e.
   ./fuseki-server --config=larq-config.ttl

3. I'm using the STW thesaurus as an easily accessible example data set(though the problem was originally found with other data sets):

   - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
   - unzip so you have stw.rdf

4. load the thesaurus file into the endpoint:
   ./s-put http://localhost:3030/ds/data default stw.rdf

6. build the LARQ index, e.g. this way:
   - kill Fuseki
   - rm -r /tmp/lucene
   - start Fuseki again, so the index will be built

7. Make SPARQL queries from the web interface at http://localhost:3030

First try this SPARQL query:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
  ?lit pf:textMatch "ar*" .
  ?conc skos:prefLabel ?lit .
  FILTER(REGEX(?lit, '^ar.*', 'i'))
} ORDER BY ?lit

I get 120 hits, including "Arab"@en.

Now try the same query, but change the pf:textMatch argument to "a*".This way I get only 32 results, not including "Arab"@en, even though theshorter prefix query should match a superset of what was matched by thefirst query (the regex should still filter it down to the same result set).

This issue is not just about single character prefix queries. Withenough data sets loaded into the same index, this happens with longerprefix queries as well.

I think that the problem might be related to Lucene's default limitationof a maximum of 1024 clauses in boolean queries (and thus prefix querymatches), as described in the Lucene FAQ:

http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F

In case this is the problem, is there any way to tell LARQ to use ahigher BooleanQuery.setMaxClauseCount() value so that this limit is nottriggered? I find it a bit disturbing that hits are silently being lost.I couldn't see any special output on the Fuseki log.

Am I doing something wrong? If this is a genuine problem in LARQ, I canof course make a bug report.



Thanks and best regards,
Osma Suominen

--
Osma Suominen | [email protected] | +358 40 5255 882

Aalto University, Department of Media Technology, Semantic ComputingResearch GroupRoom 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076Aalto, Finland

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .

[] rdf:type fuseki:Server ;
   fuseki:services (
     <#service1>
   ) .

<#service1> rdf:type fuseki:Service ;
    fuseki:name                        "ds" ;
    fuseki:serviceQuery                "sparql" ;
    fuseki:serviceQuery                "query" ;
    fuseki:serviceUpdate               "update" ;
    fuseki:serviceUpload               "upload" ;
    fuseki:serviceReadWriteGraphStore  "data" ;
    fuseki:serviceReadGraphStore       "get" ;
    fuseki:serviceReadGraphStore       "" ;
    fuseki:dataset                     <#dataset1> ;
    .

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

<#dataset1> rdf:type      tdb:DatasetTDB ;
    tdb:location "/tmp/tdb" ;
    ja:textIndex "/tmp/lucene"
    .

LARQ prefix search results missing hits

Reply via email to