Hi Jean-Marc,
> Guten Tag Lorenz ! Good job! German is a very difficult language. > > I don't know what is "IR" . IR = Information Retrieval, which is what Lucene is basically made for. > > And reusing Lucene is the plan. > The current code is here (as I mentionned earlier in this thread): > https://github.com/jmvanel/semantic_forms/blob/master/ > scala/forms/src/main/scala/deductions/runtime/jena/ > lucene/TextIndexerWeight.scala > > I don't know how to combine TF-IDF with ranking based on links. > I'm not even sure that, in an RDF world, term frequency is bringing much > useful information. > If you have some synthesis articles to recommend on search in RDF world, or > in general, that would help. There has been some discussion how to combine ranking metrics like pagerank with the standard Lucene score, e.g. [1], [2] I think this can be done via boosting during indexing or by some user-defined sort. There has been a lots of research regrading entity ranking, among others, you can have a look at [3] [1] http://blog.trifork.com/2011/11/16/apache-lucene-flexiblescoring-with-indexdocvalues/ [2] http://stackoverflow.com/questions/22473498/solr-boost-score-based-on-wikipedia-pagerank-and-solr-score [3] http://ceur-ws.org/Vol-1586/know2.pdf > > > I put on the sandbox the ranking in research (counting the links à la Google > rank), so my FOAF profile is now first, due to many cco:expertise links : > http://163.172.179.125:9111/wordsearch?q=Jean-Marc > In good Company with Jean Sablon, Jean Moulin, and pope JP 2. > > The TDB was populated with dbpedia with these scripts : > https://github.com/jmvanel/semantic_forms/blob/master/scala/forms_play/scripts/download-dbpedia.sh > https://github.com/jmvanel/semantic_forms/blob/master/scala/forms_play/scripts/populate_with_dbpedia.sh > > > 2016-11-04 10:05 GMT+01:00 Lorenz B. <[email protected]>: > >> Hello Jean-Marc, >> >> I think adding something like a pagerank score would improve the >> results. Lucene itself just uses more or less the standard IR measure >> TF/IDF. >> >> >> >> Cheers, >> Lorenz >> >>> Osma, >>> >>> That makes sense, >>> and the first tests are not bad. >>> >>> Although I'm surprised that "par*" does not get dbpedia:Paris in the >> first >>> 10; >>> but "pari*" does get dbpedia:Paris in the first position: >>> >>> "count" "s" >>> "3090"^^http://www.w3.org/2001/XMLSchema#integer >>> http://dbpedia.org/resource/Paris >>> "2676"^^http://www.w3.org/2001/XMLSchema#integer >>> http://dbpedia.org/resource/London >>> "72"^^http://www.w3.org/2001/XMLSchema#integer >>> http://dbpedia.org/resource/Émile_Durkheim >>> "68"^^http://www.w3.org/2001/XMLSchema#integer >> http://dbpedia.org/resource/ >>> Henri_Bergson >>> "66"^^http://www.w3.org/2001/XMLSchema#integer >> http://dbpedia.org/resource/ >>> 20th_arrondissement_of_Paris >>> "64"^^http://www.w3.org/2001/XMLSchema#integer >> http://dbpedia.org/resource/ >>> Cornelius_Castoriadis >>> "64"^^http://www.w3.org/2001/XMLSchema#integer >> http://dbpedia.org/resource/ >>> Jacques_Derrida >>> "63"^^http://www.w3.org/2001/XMLSchema#integer >> http://dbpedia.org/resource/ >>> Michel_Foucault "62"^^http://www.w3.org/2001/XMLSchema#integer >>> http://dbpedia.org/resource/Louis,_Grand_Condé >>> "60"^^http://www.w3.org/2001/XMLSchema#integer >> http://dbpedia.org/resource/ >>> Jean-Jacques_Rousseau >>> >>> >>> I'll add that SPARQL in my sandbox as a replacement of dbpedia lookup >>> service, >>> and tell you how it goes. >>> But I foresee that using the Lucene implementation after adding the >> weights >>> will be more efficient. But that demands more work... >>> >>> >>> 2016-11-03 14:30 GMT+01:00 Osma Suominen <[email protected]>: >>> >>>> Hi Jean-Marc! >>>> >>>> AFAIK using the weights to order results is intimately linked to the >> text >>>>> index querying. >>>>> If I want the top 10 results, the search must have the weights >> beforehand >>>>> otherwise I must get all the results to filter later. >>>>> This is the reason for using AnalyzingInfixSuggester. >>>>> Lucene 4_9_1 >>>>> https://lucene.apache.org/core/4_9_1/suggest/org/apache/luce >>>>> ne/search/suggest/analyzing/AnalyzingInfixSuggester.html >>>>> Lucene 6_2_1 >>>>> https://lucene.apache.org/core/6_2_1/suggest/org/apache/luce >>>>> ne/search/suggest/analyzing/AnalyzingInfixSuggester.html >>>>> >>>>> I guess this is what you call "performance reasons" . >>>>> >>>> I don't see why you couldn't, in principle, do something like this: >>>> >>>> SELECT ?s (COUNT(*) as ?count) >>>> WHERE { >>>> ?s text:query "édu*" . >>>> ?s ?p ?o . >>>> } >>>> GROUP BY ?s >>>> ORDER BY DESC(?count) >>>> LIMIT 10 >>>> >>>> (note: untested query) >>>> >>>> I'm sure it will get slow if the number of hits from the text index is >>>> more than a few dozen. But for a small number of results at a time, it >>>> might work. >>>> >>>> As I wrote in the original post, "I'll have to implement also the >> callback >>>>> for updates >>>>> like class TextDocProducerTriples in Jena-text." . >>>>> http://jena.apache.org/documentation/javadoc/text/org/apache >>>>> /jena/query/text/TextDocProducerTriples.html >>>>> >>>> Isn't that called only when the indexed triple changes (e.g. the one >> with >>>> rdfs:label or skos:prefLabel or whatever property you are indexing), but >>>> not when other data related to the same subject changes? So if new >> triples >>>> are added for the same subject, but its label is unchanged, then the >> text >>>> index won't see the update and thus the count of references/triples >> won't >>>> be updated either. >>>> >>>> I may be wrong here, I'm not sure how the update tracking works. >>>> >>>> -Osma >>>> >>>> >>>> >>>> -- >>>> Osma Suominen >>>> D.Sc. (Tech), Information Systems Specialist >>>> National Library of Finland >>>> P.O. Box 26 (Kaikukatu 4) >>>> 00014 HELSINGIN YLIOPISTO >>>> Tel. +358 50 3199529 >>>> [email protected] >>>> http://www.nationallibrary.fi >>>> >>> >> -- >> Lorenz Bühmann >> AKSW group, University of Leipzig >> Group: http://aksw.org - semantic web research center >> >> > -- Lorenz Bühmann AKSW group, University of Leipzig Group: http://aksw.org - semantic web research center
