Re: Text search and similar

Mikael Pesonen Mon, 13 Jan 2020 01:52:40 -0800



On 12/01/2020 21.50, Chris Tomlinson wrote:

Hi Mikael,

On Jan 10, 2020, at 4:26 AM, Mikael Pesonen <[email protected]> wrote:


Hi Chris,

On 09/01/2020 17.50, Chris Tomlinson wrote:

Hello Br,

On Jan 9, 2020, at 3:34 AM, Mikael Pesonen <[email protected]> wrote:


Hi,

I asked about these few years ago so maybe there is some new ideas.

1) Is it possible to config text index so that it would add, for example, all 
textual values (xsd:string etc) to index automatically? Now every property has 
to be configured manually.

No it is not currently possible. Perhaps more detail on how you would see using such 
a feature and how you would handle various literal datatypes (convert all to 
xsd:string?) and then how would you search, currently searches are focussed on one or 
more properties - a recent update allows to provide a list of properties that can be 
searched in a single Lucene search. More detail is available at 
https://jena.apache.org/documentation/query/text-query.html 
<https://jena.apache.org/documentation/query/text-query.html>.

In ideal case all values that are of type string literal would be indexed. 
Querys would work as now, you would define the properties you are querying, for 
example

*(?concept ?score ?prefLabel) text:query (skos:prefLabel "tech*" "lang:en") Of 
course I don't know how hard this would be to implement. *

So, you're wanting objects of type xsd:string and rdf:langString to be indexed 
with the property/predicate appearing in the triple. This in turn would mean 
that a field name would need to be created based on the resource localName of 
the property and for rdf:langString a default lang field name would need to be 
defined in the assembler file along with whatever multi-language analyzer 
structure is needed. This is tantamount to creating the entmap for the Lucene 
index configuration on-the-fly.

I'm not quite sure what resource localName and entmap mean but thiswould be ideal yes.

Reason for this is that we are providing our customers a file/metadataservice so we don't have info on what metadata is inputted. For thatreason we are using external Lucene index now and that is a bit of hassle.

2) Is there planned support for searching similar resources, based on the 
Lucene index?

I’m not aware of any such plans. More detail would be needed to evaluate 
feasibility, in particular how to recognize resources as similar.

Please note that the Jena+Lucene model is to index individual triples as Lucene 
documents not entire graphs or models which in turn leads to indexing and 
searching focussed on properties.

This would be fine. At least for our needs it would enough to find similar 
values only, not entire resources.

I’m sorry I still don’t know what constitutes "similar values”. I’m guessing you’re 
referring to using Lucene fuzzy matches, proximity matches and the like. These are already 
supported to an extent (see Jena Full Text Search 
<https://jena.apache.org/documentation/query/text-query.html>).

This sort of thing would not be released until Jena 3.15 at the earliest. I 
haven’t given any implementation thought to this other than what’s written here.

Regards,
Chris

Chris

Br

--

--
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's 
Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: [email protected]
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND


--
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's 
Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: [email protected]
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND

Re: Text search and similar

Reply via email to