Re: LARQ prefix search results missing hits

Paolo Castagna Mon, 10 Sep 2012 15:07:33 -0700

Hi Osma

On 28/08/12 14:22, Osma Suominen wrote:
> Hi Paolo!
>
> Thanks a lot for the fix! I have tested the latest snapshot and it now
> works as expected. At least until I add lots of new data and hit the new
> limit :)
>
>
> You're of course right about the search use case. I think the problem
> here is that the LARQ index can be used for two very different use cases:
>
> A. Traditional IR, in which the user cares about only the first few
> results. Lucene is obviously very good at this, though full advantage
> (especially for non-English languages) of it can only be achieved by
> using specific Analyzer implementations, which appears not to be
> supported in LARQ, at least not without writing some Java code.
>
> B. Speeding up queries on literals for e.g. autocomplete search. While
> this can be done without a text index using FILTER(REGEX()), the queries
> tend to be quite slow, as the filter is applied only afterwards. In this
> case it is important that the text index returns all possible hits, not
> just the first ones.
>
> I have no idea which is the more important use case for LARQ, but I'm
> currently only interested in B because of the requirements of the
> application I'm building (ONKI Light, a SKOS vocabulary browser for
> SPARQL endpoints).


Do you have any idea/proposal to make LARQ be good for both these
use cases?

> Currently the benefits of LARQ (at least for the out-of-the-box
> configuration for Fuseki+LARQ) for both A and B are somewhat diminished
> by these limitations:
>
> 1. The index is global and contains data from all named graphs mixed up.
> This means that when you have many named graphs with different data (as
> I do), and try to query only one graph, the LARQ query part will still
> return hits from all the other graphs, slowing down later parts of the
> query.

Yep.

I though about this while ago, but I haven't actually tried to implement
it. The changes to the index are trivial. The most
difficult part perhaps is on the property function side, but
maybe it's easy that as well.

I think this could be a good contribution, if you need it.

> 2. Similarly, the index does not allow filtering by language on the
> query level. With multilingual data, you cannot make a query matching
> e.g. only English labels but will get hits from all the other languages
> as well.

Yep.

I have no proposal for this, but I understand the user need.

> 3. The default implementation also doesn't store much context for the
> literal, meaning that you cannot restrict the search only to e.g.
> skos:prefLabel literal values in skos:Concept type resources. This will
> again increase the number of hits returned by the index internally.

I am not sure I follow this or I completely agree with you.

What you say is true, but LARQ provides a property function and you
can use it together with other triple patterns:

 {
   ?l pf:textMatch '...' .
   ?s skos:prefLabel ?l .
   ?s rdf:type skos:Concept .
 }

Now, we can argue on what a clever optimizer should/could do,
but from a point of view of the user, this is quite good and
powerful and it gets you what you want. Isn't it?

The syntax is very easy to remember and the property function
very easy to use.

The Lucene index can be kept quite simple and small.

>
> There may also be problems with prefix queries if you happen to hit the
> default BooleanQuery limit of 1024 clauses, but I haven't yet had this
> problem myself with LARQ. Another problem for use case B might be that
> the default Lucene StandardAnalyzer, which LARQ seems to use, filters
> common English stop words from the index and the query, which might
> interfer with the exact matching required for B.
>
> To be fair, other SPARQL text index implementations are not that good
> for prefix searches either. Virtuoso [1] requires at least 4 character
> prefixes to be specified (this can be changed by recompiling). AFAICT
> the 4store text index [2] doesn't support prefix queries at all, as the
> index structure requires whole words to be used (though possibly some
> creative use of subqueries and FILTER(REGEX()) could be used to still
> get some benefit of the index).
>
> Osma
>
> [1]
> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
>
> [2] http://4store.org/trac/wiki/TextIndexing
>
> 26.08.2012 22:49, Paolo Castagna wrote:
>> Hi Osma
>>
>> On 20/08/12 11:10, Osma Suominen wrote:
>>> Hi Paolo!
>>>
>>> Thanks for your quick reply.
>>>
>>> 17.08.2012 20:16, Paolo Castagna wrote:
>>>> Does your problem go away without changing the code and using:
>>>> ?lit pf:textMatch ( 'a*' 100000 )
>>>
>>> I tested this but it didn't help. If I use a parameter less than 1000
>>> then I get even fewer hits, but values above 1000 don't have any effect.
>>
>> Right.
>>
>>> I think the problem is this line in IndexLARQ.java:
>>>
>>> TopDocs topDocs = searcher.search(query, (Filter)null,
>>> LARQ.NUM_RESULTS ) ;
>>>
>>> As you can see the parameter for maximum number of hits is taken
>>> directly from the NUM_RESULTS constant. The value specified in the query
>>> has no effect on this level.
>>
>> Correct.
>>
>>>> It's not a problem adding a couple of '0'...
>>>> However, I am thinking that this would just shift the problem, isn't
>>>> it?
>>>
>>> You're right, it would just shift the problem but a sufficiently large
>>> value could be used that never caused problems in practice. Maybe you
>>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
>>
>> A lot of use cases about search are to used to drive a UI for people and
>> often only the first few results are necessary.
>>
>> Try to continue hit 'next >>' on Google, how many results can you get?
>>
>> ;-)
>>
>> Anyway, I increased the NUM_RESULT constant.
>>
>>> Or maybe LARQ should use another variant of Lucene's
>>> IndexSearcher.search(), one which takes a Collector object instead of
>>> the integer n parameter. E.g. this:
>>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
>>>
>>
>> Yes. That would be the thing to use if we want to retrieve all the
>> results from Lucene.
>>
>> More thinking is necessary here...
>>
>> In the meantime, you can find a LARQ SNAPSHOT here:
>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
>>
>>
>> Paolo
>>
>>>
>>>
>>> Thanks,
>>> Osma
>>>
>>>
>>>> On 15/08/12 10:31, Osma Suominen wrote:
>>>>> Hi Paolo!
>>>>>
>>>>> Thanks for your reply and sorry for the delay.
>>>>>
>>>>> I tested this again with today's svn snapshot and it's still a
>>>>> problem.
>>>>>
>>>>> However, after digging a bit further I found this in
>>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>>>
>>>>> --clip--
>>>>>       // The number of results returned by default
>>>>>       public static final int NUM_RESULTS             = 1000 ; //
>>>>> should
>>>>> we increase this? -- PC
>>>>> --clip--
>>>>>
>>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>>>> rebuilt Fuseki and now the problem is gone!
>>>>>
>>>>> I would suggest that this constant be increased to something larger
>>>>> than 1000. Based on the code comment, you seem to have had similar
>>>>> thoughts sometime in the past :)
>>>>>
>>>>> Thanks,
>>>>> Osma
>>>>>
>>>>>
>>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>>>> Hi Osma,
>>>>>> first of all, thanks for sharing your experience and clearly
>>>>>> describing
>>>>>> your problem.
>>>>>> Further comments inline.
>>>>>>
>>>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>>>> Hello!
>>>>>>>
>>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>>>> to make fast prefix queries on the concept labels.
>>>>>>>
>>>>>>> However, I've noticed that in some situations I get less results
>>>>>>> from
>>>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>>>> part of the query internally produces many hits, such as when
>>>>>>> doing a
>>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>>>
>>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on
>>>>>>> 2012-07-10 and
>>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the
>>>>>>> LARQ
>>>>>>> dependency to pom.xml and running mvn package. Other than this
>>>>>>> issue,
>>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>>>> Ubuntu packages.
>>>>>>>
>>>>>>>
>>>>>>> Steps to repeat:
>>>>>>>
>>>>>>> 1. package Fuseki with LARQ, as described above
>>>>>>>
>>>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>>>       ./fuseki-server --config=larq-config.ttl
>>>>>>>
>>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>>>> set (though the problem was originally found with other data sets):
>>>>>>>       - download
>>>>>>> http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>>>       - unzip so you have stw.rdf
>>>>>>>
>>>>>>> 4. load the thesaurus file into the endpoint:
>>>>>>>       ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>>>
>>>>>>> 6. build the LARQ index, e.g. this way:
>>>>>>>       - kill Fuseki
>>>>>>>       - rm -r /tmp/lucene
>>>>>>>       - start Fuseki again, so the index will be built
>>>>>>>
>>>>>>> 7. Make SPARQL queries from the web interface at
>>>>>>> http://localhost:3030
>>>>>>>
>>>>>>> First try this SPARQL query:
>>>>>>>
>>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>>      ?lit pf:textMatch "ar*" .
>>>>>>>      ?conc skos:prefLabel ?lit .
>>>>>>>      FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>>>> } ORDER BY ?lit
>>>>>>>
>>>>>>> I get 120 hits, including "Arab"@en.
>>>>>>>
>>>>>>> Now try the same query, but change the pf:textMatch argument to
>>>>>>> "a*".
>>>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>>>> the shorter prefix query should match a superset of what was matched
>>>>>>> by the first query (the regex should still filter it down to the
>>>>>>> same
>>>>>>> result set).
>>>>>>>
>>>>>>>
>>>>>>> This issue is not just about single character prefix queries. With
>>>>>>> enough data sets loaded into the same index, this happens with
>>>>>>> longer
>>>>>>> prefix queries as well.
>>>>>>>
>>>>>>> I think that the problem might be related to Lucene's default
>>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>>>> yet).
>>>>>>
>>>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>>>> not triggered? I find it a bit disturbing that hits are silently
>>>>>>> being
>>>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>>>
>>>>>> Not sure about this.
>>>>>>
>>>>>> Paolo
>>>>>>
>>>>>>>
>>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>>>> can of course make a bug report.
>>>>>>>
>>>>>>>
>>>>>>> Thanks and best regards,
>>>>>>> Osma Suominen
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>

Re: LARQ prefix search results missing hits

Reply via email to