Hi Osma, thanks for your help and feedback. Does your problem go away without changing the code and using: ?lit pf:textMatch ( 'a*' 100000 )
It's not a problem adding a couple of '0'... However, I am thinking that this would just shift the problem, isn't it? Paolo On 15/08/12 10:31, Osma Suominen wrote: > Hi Paolo! > > Thanks for your reply and sorry for the delay. > > I tested this again with today's svn snapshot and it's still a problem. > > However, after digging a bit further I found this in > jena-larq/src/main/java/org/apache/jena/larq/LARQ.java: > > --clip-- > // The number of results returned by default > public static final int NUM_RESULTS = 1000 ; // should > we increase this? -- PC > --clip-- > > I changed NUM_RESULTS to 100000 (added two zeros), built and installed > my modified LARQ with mvn install (NB this required tweaking arq.ver > and tdb.ver in jena-larq/pom.xml to match the current svn versions), > rebuilt Fuseki and now the problem is gone! > > I would suggest that this constant be increased to something larger > than 1000. Based on the code comment, you seem to have had similar > thoughts sometime in the past :) > > Thanks, > Osma > > > 15.07.2012 11:21, Paolo Castagna kirjoitti: >> Hi Osma, >> first of all, thanks for sharing your experience and clearly describing >> your problem. >> Further comments inline. >> >> On 13/07/12 14:13, Osma Suominen wrote: >>> Hello! >>> >>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to >>> create a system for accessing SKOS thesauri. The user interface >>> includes an autocompletion widget. The idea is to use the LARQ index >>> to make fast prefix queries on the concept labels. >>> >>> However, I've noticed that in some situations I get less results from >>> the index than what I'd expect. This seems to happen when the LARQ >>> part of the query internally produces many hits, such as when doing a >>> single character prefix query (e.g. ?lit pf:textMatch 'a*'). >>> >>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and >>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ >>> dependency to pom.xml and running mvn package. Other than this issue, >>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux >>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard >>> Ubuntu packages. >>> >>> >>> Steps to repeat: >>> >>> 1. package Fuseki with LARQ, as described above >>> >>> 2. start Fuseki with the attached configuration file, i.e. >>> ./fuseki-server --config=larq-config.ttl >>> >>> 3. I'm using the STW thesaurus as an easily accessible example data >>> set (though the problem was originally found with other data sets): >>> - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip >>> - unzip so you have stw.rdf >>> >>> 4. load the thesaurus file into the endpoint: >>> ./s-put http://localhost:3030/ds/data default stw.rdf >>> >>> 6. build the LARQ index, e.g. this way: >>> - kill Fuseki >>> - rm -r /tmp/lucene >>> - start Fuseki again, so the index will be built >>> >>> 7. Make SPARQL queries from the web interface at http://localhost:3030 >>> >>> First try this SPARQL query: >>> >>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#> >>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#> >>> SELECT DISTINCT * WHERE { >>> ?lit pf:textMatch "ar*" . >>> ?conc skos:prefLabel ?lit . >>> FILTER(REGEX(?lit, '^ar.*', 'i')) >>> } ORDER BY ?lit >>> >>> I get 120 hits, including "Arab"@en. >>> >>> Now try the same query, but change the pf:textMatch argument to "a*". >>> This way I get only 32 results, not including "Arab"@en, even though >>> the shorter prefix query should match a superset of what was matched >>> by the first query (the regex should still filter it down to the same >>> result set). >>> >>> >>> This issue is not just about single character prefix queries. With >>> enough data sets loaded into the same index, this happens with longer >>> prefix queries as well. >>> >>> I think that the problem might be related to Lucene's default >>> limitation of a maximum of 1024 clauses in boolean queries (and thus >>> prefix query matches), as described in the Lucene FAQ: >>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F >>> >>> >> >> Yes, I think your hypothesis might be correct (I've not verified it >> yet). >> >>> In case this is the problem, is there any way to tell LARQ to use a >>> higher BooleanQuery.setMaxClauseCount() value so that this limit is >>> not triggered? I find it a bit disturbing that hits are silently being >>> lost. I couldn't see any special output on the Fuseki log. >> >> Not sure about this. >> >> Paolo >> >>> >>> Am I doing something wrong? If this is a genuine problem in LARQ, I >>> can of course make a bug report. >>> >>> >>> Thanks and best regards, >>> Osma Suominen >>> >> >> > >
