Re: Text search and similar

Chris Tomlinson Sun, 12 Jan 2020 11:50:29 -0800

Hi Mikael,

> On Jan 10, 2020, at 4:26 AM, Mikael Pesonen <mikael.peso...@lingsoft.fi> 
> wrote:
> 
> 
> Hi Chris,
> 
> On 09/01/2020 17.50, Chris Tomlinson wrote:
>> Hello Br,
>> 
>>> On Jan 9, 2020, at 3:34 AM, Mikael Pesonen <mikael.peso...@lingsoft.fi> 
>>> wrote:
>>> 
>>> 
>>> Hi,
>>> 
>>> I asked about these few years ago so maybe there is some new ideas.
>>> 
>>> 1) Is it possible to config text index so that it would add, for example, 
>>> all textual values (xsd:string etc) to index automatically? Now every 
>>> property has to be configured manually.
>> No it is not currently possible. Perhaps more detail on how you would see 
>> using such a feature and how you would handle various literal datatypes 
>> (convert all to xsd:string?) and then how would you search, currently 
>> searches are focussed on one or more properties - a recent update allows to 
>> provide a list of properties that can be searched in a single Lucene search. 
>> More detail is available at 
>> https://jena.apache.org/documentation/query/text-query.html 
>> <https://jena.apache.org/documentation/query/text-query.html>.
> In ideal case all values that are of type string literal would be indexed. 
> Querys would work as now, you would define the properties you are querying, 
> for example
> 
> *(?concept ?score ?prefLabel) text:query (skos:prefLabel "tech*" "lang:en") 
> Of course I don't know how hard this would be to implement. *


So, you're wanting objects of type xsd:string and rdf:langString to be indexed 
with the property/predicate appearing in the triple. This in turn would mean 
that a field name would need to be created based on the resource localName of 
the property and for rdf:langString a default lang field name would need to be 
defined in the assembler file along with whatever multi-language analyzer 
structure is needed. This is tantamount to creating the entmap for the Lucene 
index configuration on-the-fly.


>> 
>>> 2) Is there planned support for searching similar resources, based on the 
>>> Lucene index?
>> I’m not aware of any such plans. More detail would be needed to evaluate 
>> feasibility, in particular how to recognize resources as similar.
>> 
>> Please note that the Jena+Lucene model is to index individual triples as 
>> Lucene documents not entire graphs or models which in turn leads to indexing 
>> and searching focussed on properties.
> This would be fine. At least for our needs it would enough to find similar 
> values only, not entire resources.

I’m sorry I still don’t know what constitutes "similar values”. I’m guessing 
you’re referring to using Lucene fuzzy matches, proximity matches and the like. 
These are already supported to an extent (see Jena Full Text Search 
<https://jena.apache.org/documentation/query/text-query.html>).

This sort of thing would not be released until Jena 3.15 at the earliest. I 
haven’t given any implementation thought to this other than what’s written here.

Regards,
Chris


>> 
>> Chris
>> 
>>> Br
>>> 
>>> -- 
>>> 
>> 
> 
> -- 
> Lingsoft - 30 years of Leading Language Management
> 
> www.lingsoft.fi
> 
> Speech Applications - Language Management - Translation - Reader's and 
> Writer's Tools - Text Tools - E-books and M-books
> 
> Mikael Pesonen
> System Engineer
> 
> e-mail: mikael.peso...@lingsoft.fi
> Tel. +358 2 279 3300
> 
> Time zone: GMT+2
> 
> Helsinki Office
> Eteläranta 10
> FI-00130 Helsinki
> FINLAND
> 
> Turku Office
> Kauppiaskatu 5 A
> FI-20100 Turku
> FINLAND
>

Re: Text search and similar

Reply via email to