Re: LARQ prefix search results missing hits

Paolo Castagna Fri, 17 Aug 2012 10:16:34 -0700

Hi Osma,
thanks for your help and feedback.

Does your problem go away without changing the code and using:
?lit pf:textMatch ( 'a*' 100000 )


It's not a problem adding a couple of '0'...
However, I am thinking that this would just shift the problem, isn't it?

Paolo

On 15/08/12 10:31, Osma Suominen wrote:
> Hi Paolo!
>
> Thanks for your reply and sorry for the delay.
>
> I tested this again with today's svn snapshot and it's still a problem.
>
> However, after digging a bit further I found this in
> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>
> --clip--
>     // The number of results returned by default
>     public static final int NUM_RESULTS             = 1000 ; // should
> we increase this? -- PC
> --clip--
>
> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
> my modified LARQ with mvn install (NB this required tweaking arq.ver
> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
> rebuilt Fuseki and now the problem is gone!
>
> I would suggest that this constant be increased to something larger
> than 1000. Based on the code comment, you seem to have had similar
> thoughts sometime in the past :)
>
> Thanks,
> Osma
>
>
> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>> Hi Osma,
>> first of all, thanks for sharing your experience and clearly describing
>> your problem.
>> Further comments inline.
>>
>> On 13/07/12 14:13, Osma Suominen wrote:
>>> Hello!
>>>
>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>> create a system for accessing SKOS thesauri. The user interface
>>> includes an autocompletion widget. The idea is to use the LARQ index
>>> to make fast prefix queries on the concept labels.
>>>
>>> However, I've noticed that in some situations I get less results from
>>> the index than what I'd expect. This seems to happen when the LARQ
>>> part of the query internally produces many hits, such as when doing a
>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>
>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>>> dependency to pom.xml and running mvn package. Other than this issue,
>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>> Ubuntu packages.
>>>
>>>
>>> Steps to repeat:
>>>
>>> 1. package Fuseki with LARQ, as described above
>>>
>>> 2. start Fuseki with the attached configuration file, i.e.
>>>     ./fuseki-server --config=larq-config.ttl
>>>
>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>> set (though the problem was originally found with other data sets):
>>>     - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>     - unzip so you have stw.rdf
>>>
>>> 4. load the thesaurus file into the endpoint:
>>>     ./s-put http://localhost:3030/ds/data default stw.rdf
>>>
>>> 6. build the LARQ index, e.g. this way:
>>>     - kill Fuseki
>>>     - rm -r /tmp/lucene
>>>     - start Fuseki again, so the index will be built
>>>
>>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>>
>>> First try this SPARQL query:
>>>
>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>> SELECT DISTINCT * WHERE {
>>>    ?lit pf:textMatch "ar*" .
>>>    ?conc skos:prefLabel ?lit .
>>>    FILTER(REGEX(?lit, '^ar.*', 'i'))
>>> } ORDER BY ?lit
>>>
>>> I get 120 hits, including "Arab"@en.
>>>
>>> Now try the same query, but change the pf:textMatch argument to "a*".
>>> This way I get only 32 results, not including "Arab"@en, even though
>>> the shorter prefix query should match a superset of what was matched
>>> by the first query (the regex should still filter it down to the same
>>> result set).
>>>
>>>
>>> This issue is not just about single character prefix queries. With
>>> enough data sets loaded into the same index, this happens with longer
>>> prefix queries as well.
>>>
>>> I think that the problem might be related to Lucene's default
>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>> prefix query matches), as described in the Lucene FAQ:
>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>
>>>
>>
>> Yes, I think your hypothesis might be correct (I've not verified it
>> yet).
>>
>>> In case this is the problem, is there any way to tell LARQ to use a
>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>> not triggered? I find it a bit disturbing that hits are silently being
>>> lost. I couldn't see any special output on the Fuseki log.
>>
>> Not sure about this.
>>
>> Paolo
>>
>>>
>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>> can of course make a bug report.
>>>
>>>
>>> Thanks and best regards,
>>> Osma Suominen
>>>
>>
>>
>
>

Re: LARQ prefix search results missing hits

Reply via email to