Re: completion with Lucene: desirable from SPARQL

Jean-Marc Vanel Thu, 03 Nov 2016 06:12:41 -0700

2016-11-03 13:34 GMT+01:00 Osma Suominen <[email protected]>:


> Hi Jean-Marc,
>
> I'm not sure I understand why you need to put the weights inside the
> Lucene index. Is it done for performance reasons?
>

AFAIK using the weights to order results is intimately linked to the text
index querying.
If I want the top 10 results, the search must have the weights beforehand
otherwise I must get all the results to filter later.
This is the reason for using AnalyzingInfixSuggester.
Lucene 4_9_1
https://lucene.apache.org/core/4_9_1/suggest/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.html
Lucene 6_2_1
https://lucene.apache.org/core/6_2_1/suggest/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.html

I guess this is what you call "performance reasons" .


> What if the data changes? I mean, not the indexed subject itself, but for
> example additional triples get added to the dataset using the same subject.
> Surely the Lucene index will get out of date?
>

As I wrote in the original post, "I'll have to implement also the callback
for updates
like class TextDocProducerTriples in Jena-text." .
http://jena.apache.org/documentation/javadoc/text/org/apache/jena/query/text/TextDocProducerTriples.html




> -Osma
>
>
> 03.11.2016, 13:51, Jean-Marc Vanel kirjoitti:
>
>> Hi Osma
>>
>> First I will implement the weight by counting the triples from and to each
>> URI being indexed in Lucene by Jena-text.
>> This will give users a first ordering in results, hopefully satisfying.
>> This is quite similar to the Google page rank, except that instead of
>> counting the <a href="XXX"> , it will count the triples.
>>
>> I sketched some code here with most of the plumbing:
>> https://github.com/jmvanel/semantic_forms/blob/master/scala/
>> forms/src/main/scala/deductions/runtime/jena/lucene/
>> TextIndexerWeight.scala
>>
>> Comments welcome. It's in Scala, but it should be understandable.
>> Note that I have one more library dependency :
>> libraryDependencies += "org.apache.lucene" % "lucene-suggest" % "4.9.1"
>>
>> This is code for batch primary indexing or re-indexing.
>> If this works well, I'll have to implement also the callback for updates
>> like class TextDocProducerTriples in Jena-text.
>>
>>
>>
>> 2016-11-01 13:59 GMT+01:00 Osma Suominen <[email protected]>:
>>
>> Hi Jean-Marc,
>>>
>>> The wildcard queries etc. are basic Lucene features, part of Lucene query
>>> syntax, so probably that's why they not documented on the jena-text page.
>>> The query string is simply passed to the Lucene query parser by jena-text
>>> and should support any features of Lucene, see:
>>> http://lucene.apache.org/core/6_2_1/queryparser/org/apache/l
>>> ucene/queryparser/classic/package-summary.html#package.description
>>>
>>> Glad you were able to get your lookup service working!
>>>
>>> Regarding the saving of weights: I think you could simply save them as
>>> triples (perhaps in a separate graph), outside the Lucene index. Then
>>> combine the results of the text:query with the weights from triples using
>>> SPARQL.
>>>
>>> The jena-text query also returns score values. I'm not sure how useful
>>> they are in your use case, but they could potentially be used as a factor
>>> in the overall "notoriety" calculation. Though if you are searching just
>>> for single words or prefixes, chances are that the score values will be
>>> the
>>> same for all results.
>>>
>>> Thanks for all the work on the Lucene 5 and 6 upgrade (JENA-1250)! I hope
>>> we can finish that work and get it merged soon after the 3.1.1 release.
>>> In
>>> any case the newer Lucene version should perform better and be easier to
>>> maintain moving forward.
>>>
>>> -Osma
>>>
>>> On 01/11/16 11:01, Jean-Marc Vanel wrote:
>>>
>>> I's too bad that the * joker feature, and other details of the SPARQL to
>>>> Lucene query translation, are not documented on the Jena text search
>>>> page.
>>>>
>>>> Anyway, it works for my use case, I now have on my laptop a (kind of)
>>>> replacement of dbPedia lookup service.
>>>>
>>>> To experiment with the original dbPedia lookup service, you can go to
>>>> semantic_forms sandbox:
>>>> http://163.172.179.125:9111/create?uri=&uri=http%3A%2F%2Fxml
>>>> ns.com%2Ffoaf%2F0.1%2FPerson
>>>> and type a few letters in the dct:subject field.
>>>>
>>>> I don't need the full original literal value, because the URI results of
>>>> the query are labelled in the application: a foaf:Person is labelled by
>>>> given and family names, etc.
>>>>
>>>> BUT, there is a "but", the dbPedia lookup service are apropriately
>>>> ordered
>>>> by "notoriety".
>>>> Instead, I currently get with http://localhost:9000/lookup?q=*Pari*
>>>>
>>>> on my TDB that mirrors dbPedia.
>>>>
>>>> <ArrayOfResult>
>>>>          <Result>
>>>>            <Label>Université Pierre-et-Marie-Curie</Label>
>>>>            <URI>http://dbpedia.org/resource/Pierre_and_Marie_Curie_
>>>> University
>>>> </URI>
>>>>          </Result><Result>
>>>>            <Label>Guillaume Le Gentil</Label>
>>>>            <URI>http://dbpedia.org/resource/Guillaume_Le_Gentil</URI>
>>>>          </Result><Result>
>>>>            <Label>1 E1 m</Label>
>>>>            <URI>http://dbpedia.org/resource/1_decametre</URI>
>>>>          </Result><Result>
>>>>            <Label>1 E4 m</Label>
>>>>            <URI>http://dbpedia.org/resource/1_myriametre</URI>
>>>>          </Result><Result>
>>>>            <Label>Nadia Boulanger</Label>
>>>>            <URI>http://dbpedia.org/resource/Nadia_Boulanger</URI>
>>>>          </Result><Result>
>>>>            <Label>Luis Mariano</Label>
>>>>            <URI>http://dbpedia.org/resource/Luis_Mariano</URI>
>>>>          </Result><Result>
>>>>            <Label>Paul Chemetov</Label>
>>>>            <URI>http://dbpedia.org/resource/Paul_Chemetov</URI>
>>>>          </Result><Result>
>>>>            <Label>Marc Boegner</Label>
>>>>            <URI>http://dbpedia.org/resource/Marc_Boegner</URI>
>>>>          </Result><Result>
>>>>            <Label>Cassandre (graphiste)</Label>
>>>>            <URI>http://dbpedia.org/resource/Cassandre_(artist)</URI>
>>>>          </Result><Result>
>>>>            <Label>La Norville</Label>
>>>>            <URI>http://dbpedia.org/resource/La_Norville</URI>
>>>>          </Result>
>>>>      </ArrayOfResult>
>>>>
>>>> My understanding is that I need to set a weight on URI's in Lucene to
>>>> reflect their "notoriety".
>>>> I see 2 ways:
>>>>
>>>>     1. easy to implement: just count the triples from and to the URI
>>>>     2. also take in account the the URI's consulted by user in my
>>>>
>>>>     application (but currently I don't record that information); there
>>>> is
>>>>     also the issue of combining weights 1) and 2)
>>>>
>>>> Google search does both weightings.
>>>>
>>>> So, in the short term I have to figure out how to add weights to the
>>>> Lucene
>>>> - Jena index.
>>>>
>>>> Then I have to read what dbPedia lookup does, and other background
>>>> material.
>>>>
>>>>
>>>>
>>>> 2016-10-31 16:42 GMT+01:00 Osma Suominen <[email protected]>:
>>>>
>>>> Hi Jean-Marc,
>>>>
>>>>>
>>>>> Depending on what exactly you want from such a service, this may be
>>>>> already possible with jena-text.
>>>>>
>>>>> I'm assuming that you want to perform a prefix search such as "édu*"
>>>>> and
>>>>> get possible completions for that, such as "éducation".
>>>>>
>>>>> You can of course already do a prefix search with jena-text. What you
>>>>> will
>>>>> get back will be the RDF resources which have labels that contain this
>>>>> prefix. If the text index is configured to store literal values, you
>>>>> can
>>>>> ask for the actual values as well.
>>>>>
>>>>> E.g. with this data:
>>>>>
>>>>> ex:cse rdfs:label "Conseil supérieur de l'éducation"@fr .
>>>>>
>>>>> and a suitably configured jena-text index, you can perform this query:
>>>>>
>>>>> (?s ?score ?literal) text:query (rdfs:label "édu*") .
>>>>>
>>>>> and get back these bindings:
>>>>>
>>>>> ?s=ex:cse ?literal="Conseil supérieur de l'éducation"@fr
>>>>>
>>>>> However, you will get the full original literal value, not just the
>>>>> individual word that matched ("éducation"). If you want just the
>>>>> matched
>>>>> word, you will need special support that jena-text doesn't currently
>>>>> have.
>>>>>
>>>>> -Osma
>>>>>
>>>>> On 17/10/16 11:37, Jean-Marc Vanel wrote:
>>>>>
>>>>> Hi
>>>>>
>>>>>>
>>>>>> I'm implementing an equivalent of dbPedia lookup service [1] in
>>>>>> semantic_forms, leveraging on Lucene integration in TDB, and dbPedia
>>>>>> mirror
>>>>>> with TDB [2] .
>>>>>>
>>>>>> The dbPedia lookup service is really nice but:
>>>>>>
>>>>>>      - the hosted service is often down
>>>>>>      - completion is in english only
>>>>>>
>>>>>> A lookup service with TDB and Lucene would overcome these 2 problems.
>>>>>>
>>>>>> So I would need completion with Lucene from SPARQL.
>>>>>> According to Jena doc., this does not seems to be implemented:
>>>>>> https://jena.apache.org/documentation/query/text-query.html#
>>>>>> query-with-sparql
>>>>>>
>>>>>> There are plenty of pages when searching for
>>>>>> lucene completion
>>>>>>
>>>>>>   From these pages there is a code snippet here
>>>>>> http://stackoverflow.com/questions/120180/how-to-do-query-
>>>>>> auto-completion-suggestions-in-lucene
>>>>>> but a regular Lucene API may exist.
>>>>>>
>>>>>> [1] https://github.com/dbpedia/lookup
>>>>>> [2]
>>>>>> https://github.com/jmvanel/semantic_forms/blob/master/doc/
>>>>>> en/administration.md#populating-with-dbpedia-mirroring-dbpedia
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>> Osma Suominen
>>>>> D.Sc. (Tech), Information Systems Specialist
>>>>> National Library of Finland
>>>>> P.O. Box 26 (Kaikukatu 4)
>>>>> 00014 HELSINGIN YLIOPISTO
>>>>> Tel. +358 50 3199529
>>>>> [email protected]
>>>>> http://www.nationallibrary.fi
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>> --
>>> Osma Suominen
>>> D.Sc. (Tech), Information Systems Specialist
>>> National Library of Finland
>>> P.O. Box 26 (Kaikukatu 4)
>>> 00014 HELSINGIN YLIOPISTO
>>> Tel. +358 50 3199529
>>> [email protected]
>>> http://www.nationallibrary.fi
>>>
>>>
>>
>>
>>
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 26 (Kaikukatu 4)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529
> [email protected]
> http://www.nationallibrary.fi
>



-- 
Jean-Marc Vanel
Profil:
http://163.172.179.125:9111/display?displayuri=http%3A%2F%2Fjmvanel.free.fr%2Fjmv.rdf%23me
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui

Re: completion with Lucene: desirable from SPARQL

Reply via email to