Re: completion with Lucene: desirable from SPARQL

Osma Suominen Thu, 03 Nov 2016 05:35:24 -0700

Hi Jean-Marc,

I'm not sure I understand why you need to put the weights inside theLucene index. Is it done for performance reasons?

What if the data changes? I mean, not the indexed subject itself, butfor example additional triples get added to the dataset using the samesubject. Surely the Lucene index will get out of date?


-Osma

03.11.2016, 13:51, Jean-Marc Vanel kirjoitti:

Hi Osma

First I will implement the weight by counting the triples from and to each
URI being indexed in Lucene by Jena-text.
This will give users a first ordering in results, hopefully satisfying.
This is quite similar to the Google page rank, except that instead of
counting the <a href="XXX"> , it will count the triples.

I sketched some code here with most of the plumbing:
https://github.com/jmvanel/semantic_forms/blob/master/scala/forms/src/main/scala/deductions/runtime/jena/lucene/TextIndexerWeight.scala

Comments welcome. It's in Scala, but it should be understandable.
Note that I have one more library dependency :
libraryDependencies += "org.apache.lucene" % "lucene-suggest" % "4.9.1"

This is code for batch primary indexing or re-indexing.
If this works well, I'll have to implement also the callback for updates
like class TextDocProducerTriples in Jena-text.



2016-11-01 13:59 GMT+01:00 Osma Suominen <osma.suomi...@helsinki.fi>:

Hi Jean-Marc,

The wildcard queries etc. are basic Lucene features, part of Lucene query
syntax, so probably that's why they not documented on the jena-text page.
The query string is simply passed to the Lucene query parser by jena-text
and should support any features of Lucene, see:
http://lucene.apache.org/core/6_2_1/queryparser/org/apache/l
ucene/queryparser/classic/package-summary.html#package.description

Glad you were able to get your lookup service working!

Regarding the saving of weights: I think you could simply save them as
triples (perhaps in a separate graph), outside the Lucene index. Then
combine the results of the text:query with the weights from triples using
SPARQL.

The jena-text query also returns score values. I'm not sure how useful
they are in your use case, but they could potentially be used as a factor
in the overall "notoriety" calculation. Though if you are searching just
for single words or prefixes, chances are that the score values will be the
same for all results.

Thanks for all the work on the Lucene 5 and 6 upgrade (JENA-1250)! I hope
we can finish that work and get it merged soon after the 3.1.1 release. In
any case the newer Lucene version should perform better and be easier to
maintain moving forward.

-Osma

On 01/11/16 11:01, Jean-Marc Vanel wrote:

I's too bad that the * joker feature, and other details of the SPARQL to
Lucene query translation, are not documented on the Jena text search page.

Anyway, it works for my use case, I now have on my laptop a (kind of)
replacement of dbPedia lookup service.

To experiment with the original dbPedia lookup service, you can go to
semantic_forms sandbox:
http://163.172.179.125:9111/create?uri=&uri=http%3A%2F%2Fxml
ns.com%2Ffoaf%2F0.1%2FPerson
and type a few letters in the dct:subject field.

I don't need the full original literal value, because the URI results of
the query are labelled in the application: a foaf:Person is labelled by
given and family names, etc.

BUT, there is a "but", the dbPedia lookup service are apropriately ordered
by "notoriety".
Instead, I currently get with http://localhost:9000/lookup?q=*Pari*

on my TDB that mirrors dbPedia.

<ArrayOfResult>
         <Result>
           <Label>Université Pierre-et-Marie-Curie</Label>
           <URI>http://dbpedia.org/resource/Pierre_and_Marie_Curie_
University
</URI>
         </Result><Result>
           <Label>Guillaume Le Gentil</Label>
           <URI>http://dbpedia.org/resource/Guillaume_Le_Gentil</URI>
         </Result><Result>
           <Label>1 E1 m</Label>
           <URI>http://dbpedia.org/resource/1_decametre</URI>
         </Result><Result>
           <Label>1 E4 m</Label>
           <URI>http://dbpedia.org/resource/1_myriametre</URI>
         </Result><Result>
           <Label>Nadia Boulanger</Label>
           <URI>http://dbpedia.org/resource/Nadia_Boulanger</URI>
         </Result><Result>
           <Label>Luis Mariano</Label>
           <URI>http://dbpedia.org/resource/Luis_Mariano</URI>
         </Result><Result>
           <Label>Paul Chemetov</Label>
           <URI>http://dbpedia.org/resource/Paul_Chemetov</URI>
         </Result><Result>
           <Label>Marc Boegner</Label>
           <URI>http://dbpedia.org/resource/Marc_Boegner</URI>
         </Result><Result>
           <Label>Cassandre (graphiste)</Label>
           <URI>http://dbpedia.org/resource/Cassandre_(artist)</URI>
         </Result><Result>
           <Label>La Norville</Label>
           <URI>http://dbpedia.org/resource/La_Norville</URI>
         </Result>
     </ArrayOfResult>

My understanding is that I need to set a weight on URI's in Lucene to
reflect their "notoriety".
I see 2 ways:

    1. easy to implement: just count the triples from and to the URI
    2. also take in account the the URI's consulted by user in my

    application (but currently I don't record that information); there is
    also the issue of combining weights 1) and 2)

Google search does both weightings.

So, in the short term I have to figure out how to add weights to the
Lucene
- Jena index.

Then I have to read what dbPedia lookup does, and other background
material.



2016-10-31 16:42 GMT+01:00 Osma Suominen <osma.suomi...@helsinki.fi>:

Hi Jean-Marc,


Depending on what exactly you want from such a service, this may be
already possible with jena-text.

I'm assuming that you want to perform a prefix search such as "édu*" and
get possible completions for that, such as "éducation".

You can of course already do a prefix search with jena-text. What you
will
get back will be the RDF resources which have labels that contain this
prefix. If the text index is configured to store literal values, you can
ask for the actual values as well.

E.g. with this data:

ex:cse rdfs:label "Conseil supérieur de l'éducation"@fr .

and a suitably configured jena-text index, you can perform this query:

(?s ?score ?literal) text:query (rdfs:label "édu*") .

and get back these bindings:

?s=ex:cse ?literal="Conseil supérieur de l'éducation"@fr

However, you will get the full original literal value, not just the
individual word that matched ("éducation"). If you want just the matched
word, you will need special support that jena-text doesn't currently
have.

-Osma

On 17/10/16 11:37, Jean-Marc Vanel wrote:

Hi


I'm implementing an equivalent of dbPedia lookup service [1] in
semantic_forms, leveraging on Lucene integration in TDB, and dbPedia
mirror
with TDB [2] .

The dbPedia lookup service is really nice but:

     - the hosted service is often down
     - completion is in english only

A lookup service with TDB and Lucene would overcome these 2 problems.

So I would need completion with Lucene from SPARQL.
According to Jena doc., this does not seems to be implemented:
https://jena.apache.org/documentation/query/text-query.html#
query-with-sparql

There are plenty of pages when searching for
lucene completion

  From these pages there is a code snippet here
http://stackoverflow.com/questions/120180/how-to-do-query-
auto-completion-suggestions-in-lucene
but a regular Lucene API may exist.

[1] https://github.com/dbpedia/lookup
[2]
https://github.com/jmvanel/semantic_forms/blob/master/doc/
en/administration.md#populating-with-dbpedia-mirroring-dbpedia

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

Re: completion with Lucene: desirable from SPARQL

Reply via email to