Hello Jean-Marc,

I think adding something like a pagerank score would improve the
results. Lucene itself just uses more or less the standard IR measure
TF/IDF.



Cheers,
Lorenz

> Osma,
>
> That makes sense,
> and the first tests are not bad.
>
> Although I'm surprised that "par*" does not get dbpedia:Paris in the first
> 10;
> but "pari*" does get dbpedia:Paris in the first position:
>
> "count" "s"
> "3090"^^http://www.w3.org/2001/XMLSchema#integer
> http://dbpedia.org/resource/Paris
> "2676"^^http://www.w3.org/2001/XMLSchema#integer
> http://dbpedia.org/resource/London
> "72"^^http://www.w3.org/2001/XMLSchema#integer
> http://dbpedia.org/resource/Émile_Durkheim
> "68"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
> Henri_Bergson
> "66"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
> 20th_arrondissement_of_Paris
> "64"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
> Cornelius_Castoriadis
> "64"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
> Jacques_Derrida
> "63"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
> Michel_Foucault "62"^^http://www.w3.org/2001/XMLSchema#integer
> http://dbpedia.org/resource/Louis,_Grand_Condé
> "60"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
> Jean-Jacques_Rousseau
>
>
> I'll add that SPARQL in my sandbox as a replacement of dbpedia lookup
> service,
> and tell you how it goes.
> But I foresee that using the Lucene implementation after adding the weights
> will be more efficient. But that demands more work...
>
>
> 2016-11-03 14:30 GMT+01:00 Osma Suominen <[email protected]>:
>
>> Hi Jean-Marc!
>>
>> AFAIK using the weights to order results is intimately linked to the text
>>> index querying.
>>> If I want the top 10 results, the search must have the weights beforehand
>>> otherwise I must get all the results to filter later.
>>> This is the reason for using AnalyzingInfixSuggester.
>>> Lucene 4_9_1
>>> https://lucene.apache.org/core/4_9_1/suggest/org/apache/luce
>>> ne/search/suggest/analyzing/AnalyzingInfixSuggester.html
>>> Lucene 6_2_1
>>> https://lucene.apache.org/core/6_2_1/suggest/org/apache/luce
>>> ne/search/suggest/analyzing/AnalyzingInfixSuggester.html
>>>
>>> I guess this is what you call "performance reasons" .
>>>
>> I don't see why you couldn't, in principle, do something like this:
>>
>> SELECT ?s (COUNT(*) as ?count)
>> WHERE {
>>   ?s text:query "édu*" .
>>   ?s ?p ?o .
>> }
>> GROUP BY ?s
>> ORDER BY DESC(?count)
>> LIMIT 10
>>
>> (note: untested query)
>>
>> I'm sure it will get slow if the number of hits from the text index is
>> more than a few dozen. But for a small number of results at a time, it
>> might work.
>>
>> As I wrote in the original post, "I'll have to implement also the callback
>>> for updates
>>> like class TextDocProducerTriples in Jena-text." .
>>> http://jena.apache.org/documentation/javadoc/text/org/apache
>>> /jena/query/text/TextDocProducerTriples.html
>>>
>> Isn't that called only when the indexed triple changes (e.g. the one with
>> rdfs:label or skos:prefLabel or whatever property you are indexing), but
>> not when other data related to the same subject changes? So if new triples
>> are added for the same subject, but its label is unchanged, then the text
>> index won't see the update and thus the count of references/triples won't
>> be updated either.
>>
>> I may be wrong here, I'm not sure how the update tracking works.
>>
>> -Osma
>>
>>
>>
>> --
>> Osma Suominen
>> D.Sc. (Tech), Information Systems Specialist
>> National Library of Finland
>> P.O. Box 26 (Kaikukatu 4)
>> 00014 HELSINGIN YLIOPISTO
>> Tel. +358 50 3199529
>> [email protected]
>> http://www.nationallibrary.fi
>>
>
>
-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Reply via email to