Hi Jean-Marc,

> Guten Tag Lorenz !
Good job! German is a very difficult language.
>
> I don't know what is "IR" .
IR = Information Retrieval, which is what Lucene is basically made for.
>
> And reusing Lucene is the plan.
> The current code is here (as I mentionned earlier in this thread):
> https://github.com/jmvanel/semantic_forms/blob/master/
> scala/forms/src/main/scala/deductions/runtime/jena/
> lucene/TextIndexerWeight.scala
>
> I don't know how to combine TF-IDF with ranking based on links.
> I'm not even sure that, in an RDF world, term frequency is bringing much
> useful information.
> If you have some synthesis articles to recommend on search in RDF world, or
> in general, that would help.
There has been some discussion how to combine ranking metrics like
pagerank with the standard Lucene score, e.g. [1], [2]
I think this can be done via boosting during indexing or by some
user-defined sort.

There has been a lots of research regrading entity ranking, among
others, you can have a look at [3]

[1]
http://blog.trifork.com/2011/11/16/apache-lucene-flexiblescoring-with-indexdocvalues/
[2]
http://stackoverflow.com/questions/22473498/solr-boost-score-based-on-wikipedia-pagerank-and-solr-score
[3] http://ceur-ws.org/Vol-1586/know2.pdf
>
>
> I put on the sandbox the ranking in research (counting the links à la Google
> rank), so my FOAF profile is now first, due to many cco:expertise links :
> http://163.172.179.125:9111/wordsearch?q=Jean-Marc
> In good Company with Jean Sablon, Jean Moulin, and pope JP 2.
>
> The TDB was populated with dbpedia with these scripts :
> https://github.com/jmvanel/semantic_forms/blob/master/scala/forms_play/scripts/download-dbpedia.sh
> https://github.com/jmvanel/semantic_forms/blob/master/scala/forms_play/scripts/populate_with_dbpedia.sh
>
>
> 2016-11-04 10:05 GMT+01:00 Lorenz B. <[email protected]>:
>
>> Hello Jean-Marc,
>>
>> I think adding something like a pagerank score would improve the
>> results. Lucene itself just uses more or less the standard IR measure
>> TF/IDF.
>>
>>
>>
>> Cheers,
>> Lorenz
>>
>>> Osma,
>>>
>>> That makes sense,
>>> and the first tests are not bad.
>>>
>>> Although I'm surprised that "par*" does not get dbpedia:Paris in the
>> first
>>> 10;
>>> but "pari*" does get dbpedia:Paris in the first position:
>>>
>>> "count" "s"
>>> "3090"^^http://www.w3.org/2001/XMLSchema#integer
>>> http://dbpedia.org/resource/Paris
>>> "2676"^^http://www.w3.org/2001/XMLSchema#integer
>>> http://dbpedia.org/resource/London
>>> "72"^^http://www.w3.org/2001/XMLSchema#integer
>>> http://dbpedia.org/resource/Émile_Durkheim
>>> "68"^^http://www.w3.org/2001/XMLSchema#integer
>> http://dbpedia.org/resource/
>>> Henri_Bergson
>>> "66"^^http://www.w3.org/2001/XMLSchema#integer
>> http://dbpedia.org/resource/
>>> 20th_arrondissement_of_Paris
>>> "64"^^http://www.w3.org/2001/XMLSchema#integer
>> http://dbpedia.org/resource/
>>> Cornelius_Castoriadis
>>> "64"^^http://www.w3.org/2001/XMLSchema#integer
>> http://dbpedia.org/resource/
>>> Jacques_Derrida
>>> "63"^^http://www.w3.org/2001/XMLSchema#integer
>> http://dbpedia.org/resource/
>>> Michel_Foucault "62"^^http://www.w3.org/2001/XMLSchema#integer
>>> http://dbpedia.org/resource/Louis,_Grand_Condé
>>> "60"^^http://www.w3.org/2001/XMLSchema#integer
>> http://dbpedia.org/resource/
>>> Jean-Jacques_Rousseau
>>>
>>>
>>> I'll add that SPARQL in my sandbox as a replacement of dbpedia lookup
>>> service,
>>> and tell you how it goes.
>>> But I foresee that using the Lucene implementation after adding the
>> weights
>>> will be more efficient. But that demands more work...
>>>
>>>
>>> 2016-11-03 14:30 GMT+01:00 Osma Suominen <[email protected]>:
>>>
>>>> Hi Jean-Marc!
>>>>
>>>> AFAIK using the weights to order results is intimately linked to the
>> text
>>>>> index querying.
>>>>> If I want the top 10 results, the search must have the weights
>> beforehand
>>>>> otherwise I must get all the results to filter later.
>>>>> This is the reason for using AnalyzingInfixSuggester.
>>>>> Lucene 4_9_1
>>>>> https://lucene.apache.org/core/4_9_1/suggest/org/apache/luce
>>>>> ne/search/suggest/analyzing/AnalyzingInfixSuggester.html
>>>>> Lucene 6_2_1
>>>>> https://lucene.apache.org/core/6_2_1/suggest/org/apache/luce
>>>>> ne/search/suggest/analyzing/AnalyzingInfixSuggester.html
>>>>>
>>>>> I guess this is what you call "performance reasons" .
>>>>>
>>>> I don't see why you couldn't, in principle, do something like this:
>>>>
>>>> SELECT ?s (COUNT(*) as ?count)
>>>> WHERE {
>>>>   ?s text:query "édu*" .
>>>>   ?s ?p ?o .
>>>> }
>>>> GROUP BY ?s
>>>> ORDER BY DESC(?count)
>>>> LIMIT 10
>>>>
>>>> (note: untested query)
>>>>
>>>> I'm sure it will get slow if the number of hits from the text index is
>>>> more than a few dozen. But for a small number of results at a time, it
>>>> might work.
>>>>
>>>> As I wrote in the original post, "I'll have to implement also the
>> callback
>>>>> for updates
>>>>> like class TextDocProducerTriples in Jena-text." .
>>>>> http://jena.apache.org/documentation/javadoc/text/org/apache
>>>>> /jena/query/text/TextDocProducerTriples.html
>>>>>
>>>> Isn't that called only when the indexed triple changes (e.g. the one
>> with
>>>> rdfs:label or skos:prefLabel or whatever property you are indexing), but
>>>> not when other data related to the same subject changes? So if new
>> triples
>>>> are added for the same subject, but its label is unchanged, then the
>> text
>>>> index won't see the update and thus the count of references/triples
>> won't
>>>> be updated either.
>>>>
>>>> I may be wrong here, I'm not sure how the update tracking works.
>>>>
>>>> -Osma
>>>>
>>>>
>>>>
>>>> --
>>>> Osma Suominen
>>>> D.Sc. (Tech), Information Systems Specialist
>>>> National Library of Finland
>>>> P.O. Box 26 (Kaikukatu 4)
>>>> 00014 HELSINGIN YLIOPISTO
>>>> Tel. +358 50 3199529
>>>> [email protected]
>>>> http://www.nationallibrary.fi
>>>>
>>>
>> --
>> Lorenz Bühmann
>> AKSW group, University of Leipzig
>> Group: http://aksw.org - semantic web research center
>>
>>
>
-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Reply via email to