Apologies, this was a mistake. Paolo
On 10 September 2012 23:07, Paolo Castagna <castagna.li...@gmail.com> wrote: > Hi Osma > > On 28/08/12 14:22, Osma Suominen wrote: >> Hi Paolo! >> >> Thanks a lot for the fix! I have tested the latest snapshot and it now >> works as expected. At least until I add lots of new data and hit the new >> limit :) >> >> >> You're of course right about the search use case. I think the problem >> here is that the LARQ index can be used for two very different use cases: >> >> A. Traditional IR, in which the user cares about only the first few >> results. Lucene is obviously very good at this, though full advantage >> (especially for non-English languages) of it can only be achieved by >> using specific Analyzer implementations, which appears not to be >> supported in LARQ, at least not without writing some Java code. >> >> B. Speeding up queries on literals for e.g. autocomplete search. While >> this can be done without a text index using FILTER(REGEX()), the queries >> tend to be quite slow, as the filter is applied only afterwards. In this >> case it is important that the text index returns all possible hits, not >> just the first ones. >> >> I have no idea which is the more important use case for LARQ, but I'm >> currently only interested in B because of the requirements of the >> application I'm building (ONKI Light, a SKOS vocabulary browser for >> SPARQL endpoints). > > Do you have any idea/proposal to make LARQ be good for both these > use cases? > >> Currently the benefits of LARQ (at least for the out-of-the-box >> configuration for Fuseki+LARQ) for both A and B are somewhat diminished >> by these limitations: >> >> 1. The index is global and contains data from all named graphs mixed up. >> This means that when you have many named graphs with different data (as >> I do), and try to query only one graph, the LARQ query part will still >> return hits from all the other graphs, slowing down later parts of the >> query. > > Yep. > > I though about this while ago, but I haven't actually tried to implement > it. The changes to the index are trivial. The most > difficult part perhaps is on the property function side, but > maybe it's easy that as well. > > I think this could be a good contribution, if you need it. > >> 2. Similarly, the index does not allow filtering by language on the >> query level. With multilingual data, you cannot make a query matching >> e.g. only English labels but will get hits from all the other languages >> as well. > > Yep. > > I have no proposal for this, but I understand the user need. > >> 3. The default implementation also doesn't store much context for the >> literal, meaning that you cannot restrict the search only to e.g. >> skos:prefLabel literal values in skos:Concept type resources. This will >> again increase the number of hits returned by the index internally. > > I am not sure I follow this or I completely agree with you. > > What you say is true, but LARQ provides a property function and you > can use it together with other triple patterns: > > { > ?l pf:textMatch '...' . > ?s skos:prefLabel ?l . > ?s rdf:type skos:Concept . > } > > Now, we can argue on what a clever optimizer should/could do, > but from a point of view of the user, this is quite good and > powerful and it gets you what you want. Isn't it? > > The syntax is very easy to remember and the property function > very easy to use. > > The Lucene index can be kept quite simple and small. > >> >> There may also be problems with prefix queries if you happen to hit the >> default BooleanQuery limit of 1024 clauses, but I haven't yet had this >> problem myself with LARQ. Another problem for use case B might be that >> the default Lucene StandardAnalyzer, which LARQ seems to use, filters >> common English stop words from the index and the query, which might >> interfer with the exact matching required for B. >> >> To be fair, other SPARQL text index implementations are not that good >> for prefix searches either. Virtuoso [1] requires at least 4 character >> prefixes to be specified (this can be changed by recompiling). AFAICT >> the 4store text index [2] doesn't support prefix queries at all, as the >> index structure requires whole words to be used (though possibly some >> creative use of subqueries and FILTER(REGEX()) could be used to still >> get some benefit of the index). >> >> Osma >> >> [1] >> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext >> >> [2] http://4store.org/trac/wiki/TextIndexing >> >> 26.08.2012 22:49, Paolo Castagna wrote: >>> Hi Osma >>> >>> On 20/08/12 11:10, Osma Suominen wrote: >>>> Hi Paolo! >>>> >>>> Thanks for your quick reply. >>>> >>>> 17.08.2012 20:16, Paolo Castagna wrote: >>>>> Does your problem go away without changing the code and using: >>>>> ?lit pf:textMatch ( 'a*' 100000 ) >>>> >>>> I tested this but it didn't help. If I use a parameter less than 1000 >>>> then I get even fewer hits, but values above 1000 don't have any effect. >>> >>> Right. >>> >>>> I think the problem is this line in IndexLARQ.java: >>>> >>>> TopDocs topDocs = searcher.search(query, (Filter)null, >>>> LARQ.NUM_RESULTS ) ; >>>> >>>> As you can see the parameter for maximum number of hits is taken >>>> directly from the NUM_RESULTS constant. The value specified in the query >>>> has no effect on this level. >>> >>> Correct. >>> >>>>> It's not a problem adding a couple of '0'... >>>>> However, I am thinking that this would just shift the problem, isn't >>>>> it? >>>> >>>> You're right, it would just shift the problem but a sufficiently large >>>> value could be used that never caused problems in practice. Maybe you >>>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :) >>> >>> A lot of use cases about search are to used to drive a UI for people and >>> often only the first few results are necessary. >>> >>> Try to continue hit 'next >>' on Google, how many results can you get? >>> >>> ;-) >>> >>> Anyway, I increased the NUM_RESULT constant. >>> >>>> Or maybe LARQ should use another variant of Lucene's >>>> IndexSearcher.search(), one which takes a Collector object instead of >>>> the integer n parameter. E.g. this: >>>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29 >>>> >>> >>> Yes. That would be the thing to use if we want to retrieve all the >>> results from Lucene. >>> >>> More thinking is necessary here... >>> >>> In the meantime, you can find a LARQ SNAPSHOT here: >>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/ >>> >>> >>> Paolo >>> >>>> >>>> >>>> Thanks, >>>> Osma >>>> >>>> >>>>> On 15/08/12 10:31, Osma Suominen wrote: >>>>>> Hi Paolo! >>>>>> >>>>>> Thanks for your reply and sorry for the delay. >>>>>> >>>>>> I tested this again with today's svn snapshot and it's still a >>>>>> problem. >>>>>> >>>>>> However, after digging a bit further I found this in >>>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java: >>>>>> >>>>>> --clip-- >>>>>> // The number of results returned by default >>>>>> public static final int NUM_RESULTS = 1000 ; // >>>>>> should >>>>>> we increase this? -- PC >>>>>> --clip-- >>>>>> >>>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed >>>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver >>>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions), >>>>>> rebuilt Fuseki and now the problem is gone! >>>>>> >>>>>> I would suggest that this constant be increased to something larger >>>>>> than 1000. Based on the code comment, you seem to have had similar >>>>>> thoughts sometime in the past :) >>>>>> >>>>>> Thanks, >>>>>> Osma >>>>>> >>>>>> >>>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti: >>>>>>> Hi Osma, >>>>>>> first of all, thanks for sharing your experience and clearly >>>>>>> describing >>>>>>> your problem. >>>>>>> Further comments inline. >>>>>>> >>>>>>> On 13/07/12 14:13, Osma Suominen wrote: >>>>>>>> Hello! >>>>>>>> >>>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to >>>>>>>> create a system for accessing SKOS thesauri. The user interface >>>>>>>> includes an autocompletion widget. The idea is to use the LARQ index >>>>>>>> to make fast prefix queries on the concept labels. >>>>>>>> >>>>>>>> However, I've noticed that in some situations I get less results >>>>>>>> from >>>>>>>> the index than what I'd expect. This seems to happen when the LARQ >>>>>>>> part of the query internally produces many hits, such as when >>>>>>>> doing a >>>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*'). >>>>>>>> >>>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on >>>>>>>> 2012-07-10 and >>>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the >>>>>>>> LARQ >>>>>>>> dependency to pom.xml and running mvn package. Other than this >>>>>>>> issue, >>>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux >>>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard >>>>>>>> Ubuntu packages. >>>>>>>> >>>>>>>> >>>>>>>> Steps to repeat: >>>>>>>> >>>>>>>> 1. package Fuseki with LARQ, as described above >>>>>>>> >>>>>>>> 2. start Fuseki with the attached configuration file, i.e. >>>>>>>> ./fuseki-server --config=larq-config.ttl >>>>>>>> >>>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data >>>>>>>> set (though the problem was originally found with other data sets): >>>>>>>> - download >>>>>>>> http://zbw.eu/stw/versions/latest/download/stw.rdf.zip >>>>>>>> - unzip so you have stw.rdf >>>>>>>> >>>>>>>> 4. load the thesaurus file into the endpoint: >>>>>>>> ./s-put http://localhost:3030/ds/data default stw.rdf >>>>>>>> >>>>>>>> 6. build the LARQ index, e.g. this way: >>>>>>>> - kill Fuseki >>>>>>>> - rm -r /tmp/lucene >>>>>>>> - start Fuseki again, so the index will be built >>>>>>>> >>>>>>>> 7. Make SPARQL queries from the web interface at >>>>>>>> http://localhost:3030 >>>>>>>> >>>>>>>> First try this SPARQL query: >>>>>>>> >>>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#> >>>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#> >>>>>>>> SELECT DISTINCT * WHERE { >>>>>>>> ?lit pf:textMatch "ar*" . >>>>>>>> ?conc skos:prefLabel ?lit . >>>>>>>> FILTER(REGEX(?lit, '^ar.*', 'i')) >>>>>>>> } ORDER BY ?lit >>>>>>>> >>>>>>>> I get 120 hits, including "Arab"@en. >>>>>>>> >>>>>>>> Now try the same query, but change the pf:textMatch argument to >>>>>>>> "a*". >>>>>>>> This way I get only 32 results, not including "Arab"@en, even though >>>>>>>> the shorter prefix query should match a superset of what was matched >>>>>>>> by the first query (the regex should still filter it down to the >>>>>>>> same >>>>>>>> result set). >>>>>>>> >>>>>>>> >>>>>>>> This issue is not just about single character prefix queries. With >>>>>>>> enough data sets loaded into the same index, this happens with >>>>>>>> longer >>>>>>>> prefix queries as well. >>>>>>>> >>>>>>>> I think that the problem might be related to Lucene's default >>>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus >>>>>>>> prefix query matches), as described in the Lucene FAQ: >>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> Yes, I think your hypothesis might be correct (I've not verified it >>>>>>> yet). >>>>>>> >>>>>>>> In case this is the problem, is there any way to tell LARQ to use a >>>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is >>>>>>>> not triggered? I find it a bit disturbing that hits are silently >>>>>>>> being >>>>>>>> lost. I couldn't see any special output on the Fuseki log. >>>>>>> >>>>>>> Not sure about this. >>>>>>> >>>>>>> Paolo >>>>>>> >>>>>>>> >>>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I >>>>>>>> can of course make a bug report. >>>>>>>> >>>>>>>> >>>>>>>> Thanks and best regards, >>>>>>>> Osma Suominen >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >>