Hello!
I'm trying to use a Fuseki SPARQL endpoint together with LARQ to create
a system for accessing SKOS thesauri. The user interface includes an
autocompletion widget. The idea is to use the LARQ index to make fast
prefix queries on the concept labels.
However, I've noticed that in some situations I get less results from
the index than what I'd expect. This seems to happen when the LARQ part
of the query internally produces many hits, such as when doing a single
character prefix query (e.g. ?lit pf:textMatch 'a*').
I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
dependency to pom.xml and running mvn package. Other than this issue,
Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux 12.04
LTS amd64 with OpenJDK 1.6.0_24 installed from the standard Ubuntu packages.
Steps to repeat:
1. package Fuseki with LARQ, as described above
2. start Fuseki with the attached configuration file, i.e.
./fuseki-server --config=larq-config.ttl
3. I'm using the STW thesaurus as an easily accessible example data set
(though the problem was originally found with other data sets):
- download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
- unzip so you have stw.rdf
4. load the thesaurus file into the endpoint:
./s-put http://localhost:3030/ds/data default stw.rdf
6. build the LARQ index, e.g. this way:
- kill Fuseki
- rm -r /tmp/lucene
- start Fuseki again, so the index will be built
7. Make SPARQL queries from the web interface at http://localhost:3030
First try this SPARQL query:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
?lit pf:textMatch "ar*" .
?conc skos:prefLabel ?lit .
FILTER(REGEX(?lit, '^ar.*', 'i'))
} ORDER BY ?lit
I get 120 hits, including "Arab"@en.
Now try the same query, but change the pf:textMatch argument to "a*".
This way I get only 32 results, not including "Arab"@en, even though the
shorter prefix query should match a superset of what was matched by the
first query (the regex should still filter it down to the same result set).
This issue is not just about single character prefix queries. With
enough data sets loaded into the same index, this happens with longer
prefix queries as well.
I think that the problem might be related to Lucene's default limitation
of a maximum of 1024 clauses in boolean queries (and thus prefix query
matches), as described in the Lucene FAQ:
http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
In case this is the problem, is there any way to tell LARQ to use a
higher BooleanQuery.setMaxClauseCount() value so that this limit is not
triggered? I find it a bit disturbing that hits are silently being lost.
I couldn't see any special output on the Fuseki log.
Am I doing something wrong? If this is a genuine problem in LARQ, I can
of course make a bug report.
Thanks and best regards,
Osma Suominen
--
Osma Suominen | [email protected] | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076
Aalto, Finland
@prefix : <#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
[] rdf:type fuseki:Server ;
fuseki:services (
<#service1>
) .
<#service1> rdf:type fuseki:Service ;
fuseki:name "ds" ;
fuseki:serviceQuery "sparql" ;
fuseki:serviceQuery "query" ;
fuseki:serviceUpdate "update" ;
fuseki:serviceUpload "upload" ;
fuseki:serviceReadWriteGraphStore "data" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:serviceReadGraphStore "" ;
fuseki:dataset <#dataset1> ;
.
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset .
tdb:GraphTDB rdfs:subClassOf ja:Model .
<#dataset1> rdf:type tdb:DatasetTDB ;
tdb:location "/tmp/tdb" ;
ja:textIndex "/tmp/lucene"
.