Guten Tag Lorenz ! I don't know what is "IR" .
And reusing Lucene is the plan. The current code is here (as I mentionned earlier in this thread): https://github.com/jmvanel/semantic_forms/blob/master/ scala/forms/src/main/scala/deductions/runtime/jena/ lucene/TextIndexerWeight.scala I don't know how to combine TF-IDF with ranking based on links. I'm not even sure that, in an RDF world, term frequency is bringing much useful information. If you have some synthesis articles to recommend on search in RDF world, or in general, that would help. I put on the sandbox the ranking in research (counting the links à la Google rank), so my FOAF profile is now first, due to many cco:expertise links : http://163.172.179.125:9111/wordsearch?q=Jean-Marc In good Company with Jean Sablon, Jean Moulin, and pope JP 2. The TDB was populated with dbpedia with these scripts : https://github.com/jmvanel/semantic_forms/blob/master/scala/forms_play/scripts/download-dbpedia.sh https://github.com/jmvanel/semantic_forms/blob/master/scala/forms_play/scripts/populate_with_dbpedia.sh 2016-11-04 10:05 GMT+01:00 Lorenz B. <[email protected]>: > Hello Jean-Marc, > > I think adding something like a pagerank score would improve the > results. Lucene itself just uses more or less the standard IR measure > TF/IDF. > > > > Cheers, > Lorenz > > > Osma, > > > > That makes sense, > > and the first tests are not bad. > > > > Although I'm surprised that "par*" does not get dbpedia:Paris in the > first > > 10; > > but "pari*" does get dbpedia:Paris in the first position: > > > > "count" "s" > > "3090"^^http://www.w3.org/2001/XMLSchema#integer > > http://dbpedia.org/resource/Paris > > "2676"^^http://www.w3.org/2001/XMLSchema#integer > > http://dbpedia.org/resource/London > > "72"^^http://www.w3.org/2001/XMLSchema#integer > > http://dbpedia.org/resource/Émile_Durkheim > > "68"^^http://www.w3.org/2001/XMLSchema#integer > http://dbpedia.org/resource/ > > Henri_Bergson > > "66"^^http://www.w3.org/2001/XMLSchema#integer > http://dbpedia.org/resource/ > > 20th_arrondissement_of_Paris > > "64"^^http://www.w3.org/2001/XMLSchema#integer > http://dbpedia.org/resource/ > > Cornelius_Castoriadis > > "64"^^http://www.w3.org/2001/XMLSchema#integer > http://dbpedia.org/resource/ > > Jacques_Derrida > > "63"^^http://www.w3.org/2001/XMLSchema#integer > http://dbpedia.org/resource/ > > Michel_Foucault "62"^^http://www.w3.org/2001/XMLSchema#integer > > http://dbpedia.org/resource/Louis,_Grand_Condé > > "60"^^http://www.w3.org/2001/XMLSchema#integer > http://dbpedia.org/resource/ > > Jean-Jacques_Rousseau > > > > > > I'll add that SPARQL in my sandbox as a replacement of dbpedia lookup > > service, > > and tell you how it goes. > > But I foresee that using the Lucene implementation after adding the > weights > > will be more efficient. But that demands more work... > > > > > > 2016-11-03 14:30 GMT+01:00 Osma Suominen <[email protected]>: > > > >> Hi Jean-Marc! > >> > >> AFAIK using the weights to order results is intimately linked to the > text > >>> index querying. > >>> If I want the top 10 results, the search must have the weights > beforehand > >>> otherwise I must get all the results to filter later. > >>> This is the reason for using AnalyzingInfixSuggester. > >>> Lucene 4_9_1 > >>> https://lucene.apache.org/core/4_9_1/suggest/org/apache/luce > >>> ne/search/suggest/analyzing/AnalyzingInfixSuggester.html > >>> Lucene 6_2_1 > >>> https://lucene.apache.org/core/6_2_1/suggest/org/apache/luce > >>> ne/search/suggest/analyzing/AnalyzingInfixSuggester.html > >>> > >>> I guess this is what you call "performance reasons" . > >>> > >> I don't see why you couldn't, in principle, do something like this: > >> > >> SELECT ?s (COUNT(*) as ?count) > >> WHERE { > >> ?s text:query "édu*" . > >> ?s ?p ?o . > >> } > >> GROUP BY ?s > >> ORDER BY DESC(?count) > >> LIMIT 10 > >> > >> (note: untested query) > >> > >> I'm sure it will get slow if the number of hits from the text index is > >> more than a few dozen. But for a small number of results at a time, it > >> might work. > >> > >> As I wrote in the original post, "I'll have to implement also the > callback > >>> for updates > >>> like class TextDocProducerTriples in Jena-text." . > >>> http://jena.apache.org/documentation/javadoc/text/org/apache > >>> /jena/query/text/TextDocProducerTriples.html > >>> > >> Isn't that called only when the indexed triple changes (e.g. the one > with > >> rdfs:label or skos:prefLabel or whatever property you are indexing), but > >> not when other data related to the same subject changes? So if new > triples > >> are added for the same subject, but its label is unchanged, then the > text > >> index won't see the update and thus the count of references/triples > won't > >> be updated either. > >> > >> I may be wrong here, I'm not sure how the update tracking works. > >> > >> -Osma > >> > >> > >> > >> -- > >> Osma Suominen > >> D.Sc. (Tech), Information Systems Specialist > >> National Library of Finland > >> P.O. Box 26 (Kaikukatu 4) > >> 00014 HELSINGIN YLIOPISTO > >> Tel. +358 50 3199529 > >> [email protected] > >> http://www.nationallibrary.fi > >> > > > > > -- > Lorenz Bühmann > AKSW group, University of Leipzig > Group: http://aksw.org - semantic web research center > > -- Jean-Marc Vanel Profil: http://163.172.179.125:9111/display?displayuri=http%3A%2F%2Fjmvanel.free.fr%2Fjmv.rdf%23me Déductions SARL - Consulting, services, training, Rule-based programming, Semantic Web +33 (0)6 89 16 29 52 Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
