Full text search method differences

David Habgood Sun, 01 Oct 2023 23:13:23 -0700

Hi users,

I've noticed some unexpected differences with full text search when using
two different configurations - basically the two configuration options
here:
https://jena.apache.org/documentation/query/text-query.html#configuring-an-analyzer


Specifically I have tried specifying the analyzer on the
text:TextIndexLucene instance (1):

<#indexLucene> a text:TextIndexLucene ;
    text:analyzer: text:LowerCaseKeywordAnalyzer ;
    text:directory "databases/fair-ease" ;
    text:storeValues true ;
    text:entityMap <#entMap> ;
    .

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:graphField       "graph" ;
    text:defaultField     "preflabel" ;
    text:map ( [ text:field "preflabel" ;
                   text:predicate skos:prefLabel ;
                    ]
...

And directly on the text:EntityMap (2):

<#indexLucene> a text:TextIndexLucene ;
    text:directory "databases/fair-ease" ;
    text:storeValues true ;
    text:entityMap <#entMap> ;
    .

<http://bodc.dev.kurrawong.ai/$/datasets#entMap>
        rdf:type           text:EntityMap ;
        text:defaultField  "text" ;
        text:entityField   "uri" ;
        text:map           (  [ text:analyzer   [ rdf:type
 text:LowerCaseKeywordAnalyzer ] ;
                               text:field      "text" ;
                               text:predicate  skos:prefLabel
                             ]
...

NB the above two examples are truncated - I'm indexing multiple fields (and
the same fields for both). Full configuration files are attached.

Method (1) gives me:
- a set of "wildcard" like search results i.e. if I search for "salinity",
a result will be returned for "salinity sensor"
- scores

Method (2) gives me:
- a set of exact match like search results
- scores
- adding wildcards gives results like method 1, but all of the scores are
equal e.g. 1.0 or 5.0

For my needs Method (1) makes more sense - I can filter the wildcard search
results using LCASE(?search) = LCASE(?match) to get exact matches and
weight these higher, and retain the scores for the wildcard matches (which
Method (2) for whatever reason returns as all equal).

Is one of the above a misconfiguration - or is there some fundamental
difference in how the indexes are constructed and the results are not
expected to be the same? I've read the documentation - will take a look at
the code next but was wondering if others have encountered something
similar.

I've only been playing with the LowerCaseKeywordAnalyzer for the time
being. I'm using Jena 4.8.0 - specifically this docker image
ghcr.io/zazuko/fuseki-geosparql:v2.3.1

Thanks,
David

PREFIX : <https://data.coypu.org/>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX fuseki:    <http://jena.apache.org/fuseki#>
PREFIX rdf:       <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:      <http://www.w3.org/2000/01/rdf-schema#>
PREFIX tdb1:      <http://jena.hpl.hp.com/2008/tdb#>
PREFIX tdb2:      <http://jena.apache.org/2016/tdb#>
PREFIX text:  <http://jena.apache.org/text#>
PREFIX ja:        <http://jena.hpl.hp.com/2005/11/Assembler#>
PREFIX geosparql: <http://jena.apache.org/geosparql#>
PREFIX ex:        <http://www.example.org/resources#>
PREFIX sdo:      <https://schema.org/>
PREFIX xsd:      <http://www.w3.org/2001/XMLSchema#>
PREFIX skos:    <http://www.w3.org/2004/02/skos/core#>

# Text and Geo service 
<#serviceTest> rdf:type fuseki:Service;
    fuseki:name "fair-ease";
    fuseki:endpoint [ fuseki:operation fuseki:query ; ] ;
    fuseki:endpoint [ fuseki:operation fuseki:query ; fuseki:name "sparql" ];
    fuseki:endpoint [ fuseki:operation fuseki:query ; fuseki:name "query" ];
    fuseki:endpoint [ fuseki:operation fuseki:update ; fuseki:name "update" ];
    fuseki:endpoint [ fuseki:operation fuseki:gsp-r ; ];
    fuseki:endpoint [ fuseki:operation fuseki:gsp-r ; fuseki:name "get" ];
    fuseki:endpoint [ fuseki:operation fuseki:gsp-rw ; fuseki:name "data" ];
    fuseki:dataset <#testTextDS> .

# Text DS
<#testTextDS> rdf:type text:TextDataset ;
    text:dataset   <#testDS> ;
    text:index     <#indexLucene> ;
    .

# Text index description
<#indexLucene> a text:TextIndexLucene ;
    text:analyzer: text:LowerCaseKeywordAnalyzer ;
    text:directory "databases/fair-ease" ;
    text:storeValues true ;
    text:entityMap <#entMap> ;
    .

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:graphField       "graph" ;
    text:defaultField     "preflabel" ;        ## Should be defined in the text:map.
    text:map (
                 [ text:field "definition" ;
                   text:predicate skos:definition ;
                    ]
                 [ text:field "preflabel" ;
                   text:predicate skos:prefLabel ;
                    ]
                 [ text:field "altlabel" ;
                   text:predicate skos:altLabel ;
                    ]
                 [ text:field "identifier" ;
                   text:predicate dcterms:identifier ;
                    ]
                 [ text:field "description" ;
                   text:predicate dcterms:description ;
                    ]
         ) .

# TDB2 dataset
<#testDS> rdf:type tdb2:DatasetTDB2 ;
    tdb2:unionDefaultGraph true ;
    tdb2:location "databases/fair-ease" ;
    .

@prefix :          <https://data.coypu.org/> .
@prefix dcterms:   <http://purl.org/dc/terms/> .
@prefix dwc:       <http://rs.tdwg.org/dwc/terms/> .
@prefix ex:        <http://www.example.org/resources#> .
@prefix fuseki:    <http://jena.apache.org/fuseki#> .
@prefix geosparql: <http://jena.apache.org/geosparql#> .
@prefix ja:        <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdf:       <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:      <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sdo:       <https://schema.org/> .
@prefix skos:      <http://www.w3.org/2004/02/skos/core#> .
@prefix tdb1:      <http://jena.hpl.hp.com/2008/tdb#> .
@prefix tdb2:      <http://jena.apache.org/2016/tdb#> .
@prefix text:      <http://jena.apache.org/text#> .
@prefix xsd:       <http://www.w3.org/2001/XMLSchema#> .

<http://bodc.dev.kurrawong.ai/$/datasets#testDS>
        rdf:type                tdb2:DatasetTDB2 ;
        tdb2:location           "databases/fair-ease" ;
        tdb2:unionDefaultGraph  true .

<http://bodc.dev.kurrawong.ai/$/datasets#entMap>
        rdf:type           text:EntityMap ;
        text:defaultField  "text" ;
        text:entityField   "uri" ;
        text:map           ( [ text:analyzer   [ rdf:type  text:LowerCaseKeywordAnalyzer ] ;
                               text:field      "definition" ;
                               text:predicate  skos:definition
                             ]
                             [ text:analyzer   [ rdf:type  text:LowerCaseKeywordAnalyzer ] ;
                               text:field      "text" ;
                               text:predicate  skos:prefLabel
                             ]
                             [ text:analyzer   [ rdf:type  text:LowerCaseKeywordAnalyzer ] ;
                               text:field      "altlabel" ;
                               text:predicate  skos:altLabel
                             ]
                             [ text:analyzer   [ rdf:type  text:LowerCaseKeywordAnalyzer ] ;
                               text:field      "identifier" ;
                               text:predicate  dcterms:identifier
                             ]
                             [ text:analyzer   [ rdf:type  text:LowerCaseKeywordAnalyzer ] ;
                               text:field      "description" ;
                               text:predicate  dcterms:description
                             ]
                           ) .

<http://bodc.dev.kurrawong.ai/$/datasets#indexLucene>
        rdf:type          text:TextIndexLucene ;
        text:directory    "databases/fair-ease" ;
        text:entityMap    <http://bodc.dev.kurrawong.ai/$/datasets#entMap> ;
        text:storeValues  true .

<http://bodc.dev.kurrawong.ai/$/datasets#serviceTest>
        rdf:type         fuseki:Service ;
        fuseki:dataset   <http://bodc.dev.kurrawong.ai/$/datasets#testTextDS> ;
        fuseki:endpoint  [ fuseki:name       "sparql" ;
                           fuseki:operation  fuseki:query
                         ] ;
        fuseki:endpoint  [ fuseki:operation  fuseki:gsp-r ] ;
        fuseki:endpoint  [ fuseki:name       "update" ;
                           fuseki:operation  fuseki:update
                         ] ;
        fuseki:endpoint  [ fuseki:name       "query" ;
                           fuseki:operation  fuseki:query
                         ] ;
        fuseki:endpoint  [ fuseki:name       "data" ;
                           fuseki:operation  fuseki:gsp-rw
                         ] ;
        fuseki:endpoint  [ fuseki:name       "get" ;
                           fuseki:operation  fuseki:gsp-r
                         ] ;
        fuseki:endpoint  [ fuseki:operation  fuseki:query ] ;
        fuseki:name      "fair-ease" .

<http://bodc.dev.kurrawong.ai/$/datasets#testTextDS>
        rdf:type      text:TextDataset ;
        text:dataset  <http://bodc.dev.kurrawong.ai/$/datasets#testDS> ;
        text:index    <http://bodc.dev.kurrawong.ai/$/datasets#indexLucene> .

Full text search method differences

Reply via email to