Hi users,
I've noticed some unexpected differences with full text search when using
two different configurations - basically the two configuration options
here:
https://jena.apache.org/documentation/query/text-query.html#configuring-an-analyzer
Specifically I have tried specifying the analyzer on the
text:TextIndexLucene instance (1):
<#indexLucene> a text:TextIndexLucene ;
text:analyzer: text:LowerCaseKeywordAnalyzer ;
text:directory "databases/fair-ease" ;
text:storeValues true ;
text:entityMap <#entMap> ;
.
<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:graphField "graph" ;
text:defaultField "preflabel" ;
text:map ( [ text:field "preflabel" ;
text:predicate skos:prefLabel ;
]
...
And directly on the text:EntityMap (2):
<#indexLucene> a text:TextIndexLucene ;
text:directory "databases/fair-ease" ;
text:storeValues true ;
text:entityMap <#entMap> ;
.
<http://bodc.dev.kurrawong.ai/$/datasets#entMap>
rdf:type text:EntityMap ;
text:defaultField "text" ;
text:entityField "uri" ;
text:map ( [ text:analyzer [ rdf:type
text:LowerCaseKeywordAnalyzer ] ;
text:field "text" ;
text:predicate skos:prefLabel
]
...
NB the above two examples are truncated - I'm indexing multiple fields (and
the same fields for both). Full configuration files are attached.
Method (1) gives me:
- a set of "wildcard" like search results i.e. if I search for "salinity",
a result will be returned for "salinity sensor"
- scores
Method (2) gives me:
- a set of exact match like search results
- scores
- adding wildcards gives results like method 1, but all of the scores are
equal e.g. 1.0 or 5.0
For my needs Method (1) makes more sense - I can filter the wildcard search
results using LCASE(?search) = LCASE(?match) to get exact matches and
weight these higher, and retain the scores for the wildcard matches (which
Method (2) for whatever reason returns as all equal).
Is one of the above a misconfiguration - or is there some fundamental
difference in how the indexes are constructed and the results are not
expected to be the same? I've read the documentation - will take a look at
the code next but was wondering if others have encountered something
similar.
I've only been playing with the LowerCaseKeywordAnalyzer for the time
being. I'm using Jena 4.8.0 - specifically this docker image
ghcr.io/zazuko/fuseki-geosparql:v2.3.1
Thanks,
David
PREFIX : <https://data.coypu.org/>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX fuseki: <http://jena.apache.org/fuseki#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX tdb1: <http://jena.hpl.hp.com/2008/tdb#>
PREFIX tdb2: <http://jena.apache.org/2016/tdb#>
PREFIX text: <http://jena.apache.org/text#>
PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
PREFIX geosparql: <http://jena.apache.org/geosparql#>
PREFIX ex: <http://www.example.org/resources#>
PREFIX sdo: <https://schema.org/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
# Text and Geo service
<#serviceTest> rdf:type fuseki:Service;
fuseki:name "fair-ease";
fuseki:endpoint [ fuseki:operation fuseki:query ; ] ;
fuseki:endpoint [ fuseki:operation fuseki:query ; fuseki:name "sparql" ];
fuseki:endpoint [ fuseki:operation fuseki:query ; fuseki:name "query" ];
fuseki:endpoint [ fuseki:operation fuseki:update ; fuseki:name "update" ];
fuseki:endpoint [ fuseki:operation fuseki:gsp-r ; ];
fuseki:endpoint [ fuseki:operation fuseki:gsp-r ; fuseki:name "get" ];
fuseki:endpoint [ fuseki:operation fuseki:gsp-rw ; fuseki:name "data" ];
fuseki:dataset <#testTextDS> .
# Text DS
<#testTextDS> rdf:type text:TextDataset ;
text:dataset <#testDS> ;
text:index <#indexLucene> ;
.
# Text index description
<#indexLucene> a text:TextIndexLucene ;
text:analyzer: text:LowerCaseKeywordAnalyzer ;
text:directory "databases/fair-ease" ;
text:storeValues true ;
text:entityMap <#entMap> ;
.
<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:graphField "graph" ;
text:defaultField "preflabel" ; ## Should be defined in the text:map.
text:map (
[ text:field "definition" ;
text:predicate skos:definition ;
]
[ text:field "preflabel" ;
text:predicate skos:prefLabel ;
]
[ text:field "altlabel" ;
text:predicate skos:altLabel ;
]
[ text:field "identifier" ;
text:predicate dcterms:identifier ;
]
[ text:field "description" ;
text:predicate dcterms:description ;
]
) .
# TDB2 dataset
<#testDS> rdf:type tdb2:DatasetTDB2 ;
tdb2:unionDefaultGraph true ;
tdb2:location "databases/fair-ease" ;
.
@prefix : <https://data.coypu.org/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix dwc: <http://rs.tdwg.org/dwc/terms/> .
@prefix ex: <http://www.example.org/resources#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix geosparql: <http://jena.apache.org/geosparql#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sdo: <https://schema.org/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix tdb1: <http://jena.hpl.hp.com/2008/tdb#> .
@prefix tdb2: <http://jena.apache.org/2016/tdb#> .
@prefix text: <http://jena.apache.org/text#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://bodc.dev.kurrawong.ai/$/datasets#testDS>
rdf:type tdb2:DatasetTDB2 ;
tdb2:location "databases/fair-ease" ;
tdb2:unionDefaultGraph true .
<http://bodc.dev.kurrawong.ai/$/datasets#entMap>
rdf:type text:EntityMap ;
text:defaultField "text" ;
text:entityField "uri" ;
text:map ( [ text:analyzer [ rdf:type text:LowerCaseKeywordAnalyzer ] ;
text:field "definition" ;
text:predicate skos:definition
]
[ text:analyzer [ rdf:type text:LowerCaseKeywordAnalyzer ] ;
text:field "text" ;
text:predicate skos:prefLabel
]
[ text:analyzer [ rdf:type text:LowerCaseKeywordAnalyzer ] ;
text:field "altlabel" ;
text:predicate skos:altLabel
]
[ text:analyzer [ rdf:type text:LowerCaseKeywordAnalyzer ] ;
text:field "identifier" ;
text:predicate dcterms:identifier
]
[ text:analyzer [ rdf:type text:LowerCaseKeywordAnalyzer ] ;
text:field "description" ;
text:predicate dcterms:description
]
) .
<http://bodc.dev.kurrawong.ai/$/datasets#indexLucene>
rdf:type text:TextIndexLucene ;
text:directory "databases/fair-ease" ;
text:entityMap <http://bodc.dev.kurrawong.ai/$/datasets#entMap> ;
text:storeValues true .
<http://bodc.dev.kurrawong.ai/$/datasets#serviceTest>
rdf:type fuseki:Service ;
fuseki:dataset <http://bodc.dev.kurrawong.ai/$/datasets#testTextDS> ;
fuseki:endpoint [ fuseki:name "sparql" ;
fuseki:operation fuseki:query
] ;
fuseki:endpoint [ fuseki:operation fuseki:gsp-r ] ;
fuseki:endpoint [ fuseki:name "update" ;
fuseki:operation fuseki:update
] ;
fuseki:endpoint [ fuseki:name "query" ;
fuseki:operation fuseki:query
] ;
fuseki:endpoint [ fuseki:name "data" ;
fuseki:operation fuseki:gsp-rw
] ;
fuseki:endpoint [ fuseki:name "get" ;
fuseki:operation fuseki:gsp-r
] ;
fuseki:endpoint [ fuseki:operation fuseki:query ] ;
fuseki:name "fair-ease" .
<http://bodc.dev.kurrawong.ai/$/datasets#testTextDS>
rdf:type text:TextDataset ;
text:dataset <http://bodc.dev.kurrawong.ai/$/datasets#testDS> ;
text:index <http://bodc.dev.kurrawong.ai/$/datasets#indexLucene> .