Re: LARQ prefix search results missing hits

Paolo Castagna Tue, 11 Sep 2012 00:57:28 -0700

Apologies, this was a mistake.

Paolo


On 10 September 2012 23:07, Paolo Castagna <castagna.li...@gmail.com> wrote:
> Hi Osma
>
> On 28/08/12 14:22, Osma Suominen wrote:
>> Hi Paolo!
>>
>> Thanks a lot for the fix! I have tested the latest snapshot and it now
>> works as expected. At least until I add lots of new data and hit the new
>> limit :)
>>
>>
>> You're of course right about the search use case. I think the problem
>> here is that the LARQ index can be used for two very different use cases:
>>
>> A. Traditional IR, in which the user cares about only the first few
>> results. Lucene is obviously very good at this, though full advantage
>> (especially for non-English languages) of it can only be achieved by
>> using specific Analyzer implementations, which appears not to be
>> supported in LARQ, at least not without writing some Java code.
>>
>> B. Speeding up queries on literals for e.g. autocomplete search. While
>> this can be done without a text index using FILTER(REGEX()), the queries
>> tend to be quite slow, as the filter is applied only afterwards. In this
>> case it is important that the text index returns all possible hits, not
>> just the first ones.
>>
>> I have no idea which is the more important use case for LARQ, but I'm
>> currently only interested in B because of the requirements of the
>> application I'm building (ONKI Light, a SKOS vocabulary browser for
>> SPARQL endpoints).
>
> Do you have any idea/proposal to make LARQ be good for both these
> use cases?
>
>> Currently the benefits of LARQ (at least for the out-of-the-box
>> configuration for Fuseki+LARQ) for both A and B are somewhat diminished
>> by these limitations:
>>
>> 1. The index is global and contains data from all named graphs mixed up.
>> This means that when you have many named graphs with different data (as
>> I do), and try to query only one graph, the LARQ query part will still
>> return hits from all the other graphs, slowing down later parts of the
>> query.
>
> Yep.
>
> I though about this while ago, but I haven't actually tried to implement
> it. The changes to the index are trivial. The most
> difficult part perhaps is on the property function side, but
> maybe it's easy that as well.
>
> I think this could be a good contribution, if you need it.
>
>> 2. Similarly, the index does not allow filtering by language on the
>> query level. With multilingual data, you cannot make a query matching
>> e.g. only English labels but will get hits from all the other languages
>> as well.
>
> Yep.
>
> I have no proposal for this, but I understand the user need.
>
>> 3. The default implementation also doesn't store much context for the
>> literal, meaning that you cannot restrict the search only to e.g.
>> skos:prefLabel literal values in skos:Concept type resources. This will
>> again increase the number of hits returned by the index internally.
>
> I am not sure I follow this or I completely agree with you.
>
> What you say is true, but LARQ provides a property function and you
> can use it together with other triple patterns:
>
>  {
>    ?l pf:textMatch '...' .
>    ?s skos:prefLabel ?l .
>    ?s rdf:type skos:Concept .
>  }
>
> Now, we can argue on what a clever optimizer should/could do,
> but from a point of view of the user, this is quite good and
> powerful and it gets you what you want. Isn't it?
>
> The syntax is very easy to remember and the property function
> very easy to use.
>
> The Lucene index can be kept quite simple and small.
>
>>
>> There may also be problems with prefix queries if you happen to hit the
>> default BooleanQuery limit of 1024 clauses, but I haven't yet had this
>> problem myself with LARQ. Another problem for use case B might be that
>> the default Lucene StandardAnalyzer, which LARQ seems to use, filters
>> common English stop words from the index and the query, which might
>> interfer with the exact matching required for B.
>>
>> To be fair, other SPARQL text index implementations are not that good
>> for prefix searches either. Virtuoso [1] requires at least 4 character
>> prefixes to be specified (this can be changed by recompiling). AFAICT
>> the 4store text index [2] doesn't support prefix queries at all, as the
>> index structure requires whole words to be used (though possibly some
>> creative use of subqueries and FILTER(REGEX()) could be used to still
>> get some benefit of the index).
>>
>> Osma
>>
>> [1]
>> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
>>
>> [2] http://4store.org/trac/wiki/TextIndexing
>>
>> 26.08.2012 22:49, Paolo Castagna wrote:
>>> Hi Osma
>>>
>>> On 20/08/12 11:10, Osma Suominen wrote:
>>>> Hi Paolo!
>>>>
>>>> Thanks for your quick reply.
>>>>
>>>> 17.08.2012 20:16, Paolo Castagna wrote:
>>>>> Does your problem go away without changing the code and using:
>>>>> ?lit pf:textMatch ( 'a*' 100000 )
>>>>
>>>> I tested this but it didn't help. If I use a parameter less than 1000
>>>> then I get even fewer hits, but values above 1000 don't have any effect.
>>>
>>> Right.
>>>
>>>> I think the problem is this line in IndexLARQ.java:
>>>>
>>>> TopDocs topDocs = searcher.search(query, (Filter)null,
>>>> LARQ.NUM_RESULTS ) ;
>>>>
>>>> As you can see the parameter for maximum number of hits is taken
>>>> directly from the NUM_RESULTS constant. The value specified in the query
>>>> has no effect on this level.
>>>
>>> Correct.
>>>
>>>>> It's not a problem adding a couple of '0'...
>>>>> However, I am thinking that this would just shift the problem, isn't
>>>>> it?
>>>>
>>>> You're right, it would just shift the problem but a sufficiently large
>>>> value could be used that never caused problems in practice. Maybe you
>>>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
>>>
>>> A lot of use cases about search are to used to drive a UI for people and
>>> often only the first few results are necessary.
>>>
>>> Try to continue hit 'next >>' on Google, how many results can you get?
>>>
>>> ;-)
>>>
>>> Anyway, I increased the NUM_RESULT constant.
>>>
>>>> Or maybe LARQ should use another variant of Lucene's
>>>> IndexSearcher.search(), one which takes a Collector object instead of
>>>> the integer n parameter. E.g. this:
>>>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
>>>>
>>>
>>> Yes. That would be the thing to use if we want to retrieve all the
>>> results from Lucene.
>>>
>>> More thinking is necessary here...
>>>
>>> In the meantime, you can find a LARQ SNAPSHOT here:
>>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
>>>
>>>
>>> Paolo
>>>
>>>>
>>>>
>>>> Thanks,
>>>> Osma
>>>>
>>>>
>>>>> On 15/08/12 10:31, Osma Suominen wrote:
>>>>>> Hi Paolo!
>>>>>>
>>>>>> Thanks for your reply and sorry for the delay.
>>>>>>
>>>>>> I tested this again with today's svn snapshot and it's still a
>>>>>> problem.
>>>>>>
>>>>>> However, after digging a bit further I found this in
>>>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>>>>
>>>>>> --clip--
>>>>>>       // The number of results returned by default
>>>>>>       public static final int NUM_RESULTS             = 1000 ; //
>>>>>> should
>>>>>> we increase this? -- PC
>>>>>> --clip--
>>>>>>
>>>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>>>>> rebuilt Fuseki and now the problem is gone!
>>>>>>
>>>>>> I would suggest that this constant be increased to something larger
>>>>>> than 1000. Based on the code comment, you seem to have had similar
>>>>>> thoughts sometime in the past :)
>>>>>>
>>>>>> Thanks,
>>>>>> Osma
>>>>>>
>>>>>>
>>>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>>>>> Hi Osma,
>>>>>>> first of all, thanks for sharing your experience and clearly
>>>>>>> describing
>>>>>>> your problem.
>>>>>>> Further comments inline.
>>>>>>>
>>>>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>>>>> to make fast prefix queries on the concept labels.
>>>>>>>>
>>>>>>>> However, I've noticed that in some situations I get less results
>>>>>>>> from
>>>>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>>>>> part of the query internally produces many hits, such as when
>>>>>>>> doing a
>>>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>>>>
>>>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on
>>>>>>>> 2012-07-10 and
>>>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the
>>>>>>>> LARQ
>>>>>>>> dependency to pom.xml and running mvn package. Other than this
>>>>>>>> issue,
>>>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>>>>> Ubuntu packages.
>>>>>>>>
>>>>>>>>
>>>>>>>> Steps to repeat:
>>>>>>>>
>>>>>>>> 1. package Fuseki with LARQ, as described above
>>>>>>>>
>>>>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>>>>       ./fuseki-server --config=larq-config.ttl
>>>>>>>>
>>>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>>>>> set (though the problem was originally found with other data sets):
>>>>>>>>       - download
>>>>>>>> http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>>>>       - unzip so you have stw.rdf
>>>>>>>>
>>>>>>>> 4. load the thesaurus file into the endpoint:
>>>>>>>>       ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>>>>
>>>>>>>> 6. build the LARQ index, e.g. this way:
>>>>>>>>       - kill Fuseki
>>>>>>>>       - rm -r /tmp/lucene
>>>>>>>>       - start Fuseki again, so the index will be built
>>>>>>>>
>>>>>>>> 7. Make SPARQL queries from the web interface at
>>>>>>>> http://localhost:3030
>>>>>>>>
>>>>>>>> First try this SPARQL query:
>>>>>>>>
>>>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>>>      ?lit pf:textMatch "ar*" .
>>>>>>>>      ?conc skos:prefLabel ?lit .
>>>>>>>>      FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>>>>> } ORDER BY ?lit
>>>>>>>>
>>>>>>>> I get 120 hits, including "Arab"@en.
>>>>>>>>
>>>>>>>> Now try the same query, but change the pf:textMatch argument to
>>>>>>>> "a*".
>>>>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>>>>> the shorter prefix query should match a superset of what was matched
>>>>>>>> by the first query (the regex should still filter it down to the
>>>>>>>> same
>>>>>>>> result set).
>>>>>>>>
>>>>>>>>
>>>>>>>> This issue is not just about single character prefix queries. With
>>>>>>>> enough data sets loaded into the same index, this happens with
>>>>>>>> longer
>>>>>>>> prefix queries as well.
>>>>>>>>
>>>>>>>> I think that the problem might be related to Lucene's default
>>>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>>>>> yet).
>>>>>>>
>>>>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>>>>> not triggered? I find it a bit disturbing that hits are silently
>>>>>>>> being
>>>>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>>>>
>>>>>>> Not sure about this.
>>>>>>>
>>>>>>> Paolo
>>>>>>>
>>>>>>>>
>>>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>>>>> can of course make a bug report.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks and best regards,
>>>>>>>> Osma Suominen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>

Re: LARQ prefix search results missing hits

Reply via email to