Hello Jean-Marc, I think adding something like a pagerank score would improve the results. Lucene itself just uses more or less the standard IR measure TF/IDF.
Cheers, Lorenz > Osma, > > That makes sense, > and the first tests are not bad. > > Although I'm surprised that "par*" does not get dbpedia:Paris in the first > 10; > but "pari*" does get dbpedia:Paris in the first position: > > "count" "s" > "3090"^^http://www.w3.org/2001/XMLSchema#integer > http://dbpedia.org/resource/Paris > "2676"^^http://www.w3.org/2001/XMLSchema#integer > http://dbpedia.org/resource/London > "72"^^http://www.w3.org/2001/XMLSchema#integer > http://dbpedia.org/resource/Émile_Durkheim > "68"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/ > Henri_Bergson > "66"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/ > 20th_arrondissement_of_Paris > "64"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/ > Cornelius_Castoriadis > "64"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/ > Jacques_Derrida > "63"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/ > Michel_Foucault "62"^^http://www.w3.org/2001/XMLSchema#integer > http://dbpedia.org/resource/Louis,_Grand_Condé > "60"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/ > Jean-Jacques_Rousseau > > > I'll add that SPARQL in my sandbox as a replacement of dbpedia lookup > service, > and tell you how it goes. > But I foresee that using the Lucene implementation after adding the weights > will be more efficient. But that demands more work... > > > 2016-11-03 14:30 GMT+01:00 Osma Suominen <[email protected]>: > >> Hi Jean-Marc! >> >> AFAIK using the weights to order results is intimately linked to the text >>> index querying. >>> If I want the top 10 results, the search must have the weights beforehand >>> otherwise I must get all the results to filter later. >>> This is the reason for using AnalyzingInfixSuggester. >>> Lucene 4_9_1 >>> https://lucene.apache.org/core/4_9_1/suggest/org/apache/luce >>> ne/search/suggest/analyzing/AnalyzingInfixSuggester.html >>> Lucene 6_2_1 >>> https://lucene.apache.org/core/6_2_1/suggest/org/apache/luce >>> ne/search/suggest/analyzing/AnalyzingInfixSuggester.html >>> >>> I guess this is what you call "performance reasons" . >>> >> I don't see why you couldn't, in principle, do something like this: >> >> SELECT ?s (COUNT(*) as ?count) >> WHERE { >> ?s text:query "édu*" . >> ?s ?p ?o . >> } >> GROUP BY ?s >> ORDER BY DESC(?count) >> LIMIT 10 >> >> (note: untested query) >> >> I'm sure it will get slow if the number of hits from the text index is >> more than a few dozen. But for a small number of results at a time, it >> might work. >> >> As I wrote in the original post, "I'll have to implement also the callback >>> for updates >>> like class TextDocProducerTriples in Jena-text." . >>> http://jena.apache.org/documentation/javadoc/text/org/apache >>> /jena/query/text/TextDocProducerTriples.html >>> >> Isn't that called only when the indexed triple changes (e.g. the one with >> rdfs:label or skos:prefLabel or whatever property you are indexing), but >> not when other data related to the same subject changes? So if new triples >> are added for the same subject, but its label is unchanged, then the text >> index won't see the update and thus the count of references/triples won't >> be updated either. >> >> I may be wrong here, I'm not sure how the update tracking works. >> >> -Osma >> >> >> >> -- >> Osma Suominen >> D.Sc. (Tech), Information Systems Specialist >> National Library of Finland >> P.O. Box 26 (Kaikukatu 4) >> 00014 HELSINGIN YLIOPISTO >> Tel. +358 50 3199529 >> [email protected] >> http://www.nationallibrary.fi >> > > -- Lorenz Bühmann AKSW group, University of Leipzig Group: http://aksw.org - semantic web research center
