2016-11-03 13:34 GMT+01:00 Osma Suominen <[email protected]>:
> Hi Jean-Marc, > > I'm not sure I understand why you need to put the weights inside the > Lucene index. Is it done for performance reasons? > AFAIK using the weights to order results is intimately linked to the text index querying. If I want the top 10 results, the search must have the weights beforehand otherwise I must get all the results to filter later. This is the reason for using AnalyzingInfixSuggester. Lucene 4_9_1 https://lucene.apache.org/core/4_9_1/suggest/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.html Lucene 6_2_1 https://lucene.apache.org/core/6_2_1/suggest/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.html I guess this is what you call "performance reasons" . > What if the data changes? I mean, not the indexed subject itself, but for > example additional triples get added to the dataset using the same subject. > Surely the Lucene index will get out of date? > As I wrote in the original post, "I'll have to implement also the callback for updates like class TextDocProducerTriples in Jena-text." . http://jena.apache.org/documentation/javadoc/text/org/apache/jena/query/text/TextDocProducerTriples.html > -Osma > > > 03.11.2016, 13:51, Jean-Marc Vanel kirjoitti: > >> Hi Osma >> >> First I will implement the weight by counting the triples from and to each >> URI being indexed in Lucene by Jena-text. >> This will give users a first ordering in results, hopefully satisfying. >> This is quite similar to the Google page rank, except that instead of >> counting the <a href="XXX"> , it will count the triples. >> >> I sketched some code here with most of the plumbing: >> https://github.com/jmvanel/semantic_forms/blob/master/scala/ >> forms/src/main/scala/deductions/runtime/jena/lucene/ >> TextIndexerWeight.scala >> >> Comments welcome. It's in Scala, but it should be understandable. >> Note that I have one more library dependency : >> libraryDependencies += "org.apache.lucene" % "lucene-suggest" % "4.9.1" >> >> This is code for batch primary indexing or re-indexing. >> If this works well, I'll have to implement also the callback for updates >> like class TextDocProducerTriples in Jena-text. >> >> >> >> 2016-11-01 13:59 GMT+01:00 Osma Suominen <[email protected]>: >> >> Hi Jean-Marc, >>> >>> The wildcard queries etc. are basic Lucene features, part of Lucene query >>> syntax, so probably that's why they not documented on the jena-text page. >>> The query string is simply passed to the Lucene query parser by jena-text >>> and should support any features of Lucene, see: >>> http://lucene.apache.org/core/6_2_1/queryparser/org/apache/l >>> ucene/queryparser/classic/package-summary.html#package.description >>> >>> Glad you were able to get your lookup service working! >>> >>> Regarding the saving of weights: I think you could simply save them as >>> triples (perhaps in a separate graph), outside the Lucene index. Then >>> combine the results of the text:query with the weights from triples using >>> SPARQL. >>> >>> The jena-text query also returns score values. I'm not sure how useful >>> they are in your use case, but they could potentially be used as a factor >>> in the overall "notoriety" calculation. Though if you are searching just >>> for single words or prefixes, chances are that the score values will be >>> the >>> same for all results. >>> >>> Thanks for all the work on the Lucene 5 and 6 upgrade (JENA-1250)! I hope >>> we can finish that work and get it merged soon after the 3.1.1 release. >>> In >>> any case the newer Lucene version should perform better and be easier to >>> maintain moving forward. >>> >>> -Osma >>> >>> On 01/11/16 11:01, Jean-Marc Vanel wrote: >>> >>> I's too bad that the * joker feature, and other details of the SPARQL to >>>> Lucene query translation, are not documented on the Jena text search >>>> page. >>>> >>>> Anyway, it works for my use case, I now have on my laptop a (kind of) >>>> replacement of dbPedia lookup service. >>>> >>>> To experiment with the original dbPedia lookup service, you can go to >>>> semantic_forms sandbox: >>>> http://163.172.179.125:9111/create?uri=&uri=http%3A%2F%2Fxml >>>> ns.com%2Ffoaf%2F0.1%2FPerson >>>> and type a few letters in the dct:subject field. >>>> >>>> I don't need the full original literal value, because the URI results of >>>> the query are labelled in the application: a foaf:Person is labelled by >>>> given and family names, etc. >>>> >>>> BUT, there is a "but", the dbPedia lookup service are apropriately >>>> ordered >>>> by "notoriety". >>>> Instead, I currently get with http://localhost:9000/lookup?q=*Pari* >>>> >>>> on my TDB that mirrors dbPedia. >>>> >>>> <ArrayOfResult> >>>> <Result> >>>> <Label>Université Pierre-et-Marie-Curie</Label> >>>> <URI>http://dbpedia.org/resource/Pierre_and_Marie_Curie_ >>>> University >>>> </URI> >>>> </Result><Result> >>>> <Label>Guillaume Le Gentil</Label> >>>> <URI>http://dbpedia.org/resource/Guillaume_Le_Gentil</URI> >>>> </Result><Result> >>>> <Label>1 E1 m</Label> >>>> <URI>http://dbpedia.org/resource/1_decametre</URI> >>>> </Result><Result> >>>> <Label>1 E4 m</Label> >>>> <URI>http://dbpedia.org/resource/1_myriametre</URI> >>>> </Result><Result> >>>> <Label>Nadia Boulanger</Label> >>>> <URI>http://dbpedia.org/resource/Nadia_Boulanger</URI> >>>> </Result><Result> >>>> <Label>Luis Mariano</Label> >>>> <URI>http://dbpedia.org/resource/Luis_Mariano</URI> >>>> </Result><Result> >>>> <Label>Paul Chemetov</Label> >>>> <URI>http://dbpedia.org/resource/Paul_Chemetov</URI> >>>> </Result><Result> >>>> <Label>Marc Boegner</Label> >>>> <URI>http://dbpedia.org/resource/Marc_Boegner</URI> >>>> </Result><Result> >>>> <Label>Cassandre (graphiste)</Label> >>>> <URI>http://dbpedia.org/resource/Cassandre_(artist)</URI> >>>> </Result><Result> >>>> <Label>La Norville</Label> >>>> <URI>http://dbpedia.org/resource/La_Norville</URI> >>>> </Result> >>>> </ArrayOfResult> >>>> >>>> My understanding is that I need to set a weight on URI's in Lucene to >>>> reflect their "notoriety". >>>> I see 2 ways: >>>> >>>> 1. easy to implement: just count the triples from and to the URI >>>> 2. also take in account the the URI's consulted by user in my >>>> >>>> application (but currently I don't record that information); there >>>> is >>>> also the issue of combining weights 1) and 2) >>>> >>>> Google search does both weightings. >>>> >>>> So, in the short term I have to figure out how to add weights to the >>>> Lucene >>>> - Jena index. >>>> >>>> Then I have to read what dbPedia lookup does, and other background >>>> material. >>>> >>>> >>>> >>>> 2016-10-31 16:42 GMT+01:00 Osma Suominen <[email protected]>: >>>> >>>> Hi Jean-Marc, >>>> >>>>> >>>>> Depending on what exactly you want from such a service, this may be >>>>> already possible with jena-text. >>>>> >>>>> I'm assuming that you want to perform a prefix search such as "édu*" >>>>> and >>>>> get possible completions for that, such as "éducation". >>>>> >>>>> You can of course already do a prefix search with jena-text. What you >>>>> will >>>>> get back will be the RDF resources which have labels that contain this >>>>> prefix. If the text index is configured to store literal values, you >>>>> can >>>>> ask for the actual values as well. >>>>> >>>>> E.g. with this data: >>>>> >>>>> ex:cse rdfs:label "Conseil supérieur de l'éducation"@fr . >>>>> >>>>> and a suitably configured jena-text index, you can perform this query: >>>>> >>>>> (?s ?score ?literal) text:query (rdfs:label "édu*") . >>>>> >>>>> and get back these bindings: >>>>> >>>>> ?s=ex:cse ?literal="Conseil supérieur de l'éducation"@fr >>>>> >>>>> However, you will get the full original literal value, not just the >>>>> individual word that matched ("éducation"). If you want just the >>>>> matched >>>>> word, you will need special support that jena-text doesn't currently >>>>> have. >>>>> >>>>> -Osma >>>>> >>>>> On 17/10/16 11:37, Jean-Marc Vanel wrote: >>>>> >>>>> Hi >>>>> >>>>>> >>>>>> I'm implementing an equivalent of dbPedia lookup service [1] in >>>>>> semantic_forms, leveraging on Lucene integration in TDB, and dbPedia >>>>>> mirror >>>>>> with TDB [2] . >>>>>> >>>>>> The dbPedia lookup service is really nice but: >>>>>> >>>>>> - the hosted service is often down >>>>>> - completion is in english only >>>>>> >>>>>> A lookup service with TDB and Lucene would overcome these 2 problems. >>>>>> >>>>>> So I would need completion with Lucene from SPARQL. >>>>>> According to Jena doc., this does not seems to be implemented: >>>>>> https://jena.apache.org/documentation/query/text-query.html# >>>>>> query-with-sparql >>>>>> >>>>>> There are plenty of pages when searching for >>>>>> lucene completion >>>>>> >>>>>> From these pages there is a code snippet here >>>>>> http://stackoverflow.com/questions/120180/how-to-do-query- >>>>>> auto-completion-suggestions-in-lucene >>>>>> but a regular Lucene API may exist. >>>>>> >>>>>> [1] https://github.com/dbpedia/lookup >>>>>> [2] >>>>>> https://github.com/jmvanel/semantic_forms/blob/master/doc/ >>>>>> en/administration.md#populating-with-dbpedia-mirroring-dbpedia >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>> Osma Suominen >>>>> D.Sc. (Tech), Information Systems Specialist >>>>> National Library of Finland >>>>> P.O. Box 26 (Kaikukatu 4) >>>>> 00014 HELSINGIN YLIOPISTO >>>>> Tel. +358 50 3199529 >>>>> [email protected] >>>>> http://www.nationallibrary.fi >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> -- >>> Osma Suominen >>> D.Sc. (Tech), Information Systems Specialist >>> National Library of Finland >>> P.O. Box 26 (Kaikukatu 4) >>> 00014 HELSINGIN YLIOPISTO >>> Tel. +358 50 3199529 >>> [email protected] >>> http://www.nationallibrary.fi >>> >>> >> >> >> > > -- > Osma Suominen > D.Sc. (Tech), Information Systems Specialist > National Library of Finland > P.O. Box 26 (Kaikukatu 4) > 00014 HELSINGIN YLIOPISTO > Tel. +358 50 3199529 > [email protected] > http://www.nationallibrary.fi > -- Jean-Marc Vanel Profil: http://163.172.179.125:9111/display?displayuri=http%3A%2F%2Fjmvanel.free.fr%2Fjmv.rdf%23me Déductions SARL - Consulting, services, training, Rule-based programming, Semantic Web +33 (0)6 89 16 29 52 Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
