Re: LARQ prefix search results missing hits

Paolo Castagna Sun, 26 Aug 2012 12:49:49 -0700

Hi Osma

On 20/08/12 11:10, Osma Suominen wrote:
> Hi Paolo!
> 
> Thanks for your quick reply.
> 
> 17.08.2012 20:16, Paolo Castagna wrote:
>> Does your problem go away without changing the code and using:
>> ?lit pf:textMatch ( 'a*' 100000 )
> 
> I tested this but it didn't help. If I use a parameter less than 1000
> then I get even fewer hits, but values above 1000 don't have any effect.


Right.

> I think the problem is this line in IndexLARQ.java:
> 
> TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ;
> 
> As you can see the parameter for maximum number of hits is taken
> directly from the NUM_RESULTS constant. The value specified in the query
> has no effect on this level.

Correct.

>> It's not a problem adding a couple of '0'...
>> However, I am thinking that this would just shift the problem, isn't it?
> 
> You're right, it would just shift the problem but a sufficiently large
> value could be used that never caused problems in practice. Maybe you
> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)

A lot of use cases about search are to used to drive a UI for people and
often only the first few results are necessary.

Try to continue hit 'next >>' on Google, how many results can you get?

;-)

Anyway, I increased the NUM_RESULT constant.

> Or maybe LARQ should use another variant of Lucene's
> IndexSearcher.search(), one which takes a Collector object instead of
> the integer n parameter. E.g. this:
> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29

Yes. That would be the thing to use if we want to retrieve all the
results from Lucene.

More thinking is necessary here...

In the meantime, you can find a LARQ SNAPSHOT here:
https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/

Paolo

> 
> 
> Thanks,
> Osma
> 
> 
>> On 15/08/12 10:31, Osma Suominen wrote:
>>> Hi Paolo!
>>>
>>> Thanks for your reply and sorry for the delay.
>>>
>>> I tested this again with today's svn snapshot and it's still a problem.
>>>
>>> However, after digging a bit further I found this in
>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>
>>> --clip--
>>>      // The number of results returned by default
>>>      public static final int NUM_RESULTS             = 1000 ; // should
>>> we increase this? -- PC
>>> --clip--
>>>
>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>> rebuilt Fuseki and now the problem is gone!
>>>
>>> I would suggest that this constant be increased to something larger
>>> than 1000. Based on the code comment, you seem to have had similar
>>> thoughts sometime in the past :)
>>>
>>> Thanks,
>>> Osma
>>>
>>>
>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>> Hi Osma,
>>>> first of all, thanks for sharing your experience and clearly describing
>>>> your problem.
>>>> Further comments inline.
>>>>
>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>> Hello!
>>>>>
>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>> to make fast prefix queries on the concept labels.
>>>>>
>>>>> However, I've noticed that in some situations I get less results from
>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>> part of the query internally produces many hits, such as when doing a
>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>
>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>>>>> dependency to pom.xml and running mvn package. Other than this issue,
>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>> Ubuntu packages.
>>>>>
>>>>>
>>>>> Steps to repeat:
>>>>>
>>>>> 1. package Fuseki with LARQ, as described above
>>>>>
>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>      ./fuseki-server --config=larq-config.ttl
>>>>>
>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>> set (though the problem was originally found with other data sets):
>>>>>      - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>      - unzip so you have stw.rdf
>>>>>
>>>>> 4. load the thesaurus file into the endpoint:
>>>>>      ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>
>>>>> 6. build the LARQ index, e.g. this way:
>>>>>      - kill Fuseki
>>>>>      - rm -r /tmp/lucene
>>>>>      - start Fuseki again, so the index will be built
>>>>>
>>>>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>>>>
>>>>> First try this SPARQL query:
>>>>>
>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>> SELECT DISTINCT * WHERE {
>>>>>     ?lit pf:textMatch "ar*" .
>>>>>     ?conc skos:prefLabel ?lit .
>>>>>     FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>> } ORDER BY ?lit
>>>>>
>>>>> I get 120 hits, including "Arab"@en.
>>>>>
>>>>> Now try the same query, but change the pf:textMatch argument to "a*".
>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>> the shorter prefix query should match a superset of what was matched
>>>>> by the first query (the regex should still filter it down to the same
>>>>> result set).
>>>>>
>>>>>
>>>>> This issue is not just about single character prefix queries. With
>>>>> enough data sets loaded into the same index, this happens with longer
>>>>> prefix queries as well.
>>>>>
>>>>> I think that the problem might be related to Lucene's default
>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>
>>>>>
>>>>>
>>>>
>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>> yet).
>>>>
>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>> not triggered? I find it a bit disturbing that hits are silently being
>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>
>>>> Not sure about this.
>>>>
>>>> Paolo
>>>>
>>>>>
>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>> can of course make a bug report.
>>>>>
>>>>>
>>>>> Thanks and best regards,
>>>>> Osma Suominen
>>>>>
>>>>
>>>>
>>>
>>>
>>
> 
>

Re: LARQ prefix search results missing hits

Reply via email to