Jena Full Text Search poor performance

Goławski , Paweł Wed, 15 Jun 2022 06:37:33 -0700

Hi,
I'm trying to use Jena Full Text Search feature according to 
https://jena.apache.org/documentation/query/text-query.html
I've noticed that queries using "text:query" are very slow: ~20 times slower 
that similar using "FILTER contains" clause.
There are ~5.5M triples in database, 18230 triples with indexed predicate.
Database takes 1.3GB and index 4.2M disc space.
Available memory for fuseki server is 16GB.


My config is quite easy, there is nothing special configured:

################################################################################################

PREFIX :        <#>
PREFIX fuseki:  http://jena.apache.org/fuseki#
PREFIX rdf:     http://www.w3.org/1999/02/22-rdf-syntax-ns#
PREFIX rdfs:    http://www.w3.org/2000/01/rdf-schema#
PREFIX ja:      http://jena.hpl.hp.com/2005/11/Assembler#
PREFIX tdb:     http://jena.hpl.hp.com/2008/tdb#
PREFIX tdb2:    http://jena.apache.org/2016/tdb#
PREFIX text:    http://jena.apache.org/text#
PREFIX skos:    http://www.w3.org/2004/02/skos/core#
PREFIX fhir:    http://hl7.org/fhir/
PREFIX tes:     http://mycompany/tes/

[] rdf:type fuseki:Server ;
   fuseki:services (
                       :service
                   ) .

:service rdf:type fuseki:Service ;
                     fuseki:name "tes" ;
                     fuseki:serviceQuery               "query" , "sparql" ;    
# SPARQL query service
                     fuseki:serviceUpdate              "update" ;   # SPARQL 
update service
                     fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL 
Graph store protocol (read and write)
                     fuseki:serviceReadGraphStore      "get" ;
                     fuseki:serviceUpload              "upload" ;
                     fuseki:dataset :text_dataset ;
.

# A TextDataset is a regular dataset with a text index.
:text_dataset rdf:type    text:TextDataset ;
                          text:dataset   :tdb2_dataset_readwrite;
                          text:index     :indexLucene ;
.

# A TDB dataset used for RDF storage
:tdb2_dataset_readwrite rdf:type tdb2:DatasetTDB ;
    tdb2:location  "databases/db" ;
.


:indexLucene a text:TextIndexLucene ;
     text:directory "databases/db-index" ;
     text:entityMap :entMap ;
     text:storeValues true ;
     text:analyzer [
                       a text:StandardAnalyzer ;
#                       text:stopWords ("the" "a" "an" "and" "but")
                   ] ;
#    text:queryAnalyzer [ a text:StandardAnalyzer ] ;
     text:queryParser text:QueryParser ;
# text:multilingualSupport true ; # optional
.
# Entity map (see documentation for other options)
:entMap a text:EntityMap ;
            text:defaultField     "tesValue" ;
            text:entityField      "uri" ;
            text:uidField         "uid" ;
            text:langField        "lang" ;
            text:graphField       "graph" ;
            text:map (
                         [ text:field "tesValue" ;
                           text:predicate tes:indexedValue
                         ]
                     )
.

################################################################################################



There are very similar SPARQL queries:

*         with "text:query" clause:



PREFIX  tes:  http://mycompany/tes/

PREFIX  fhir: http://hl7.org/fhir/

PREFIX  rdf:  http://www.w3.org/1999/02/22-rdf-syntax-ns#

PREFIX  owl:  http://www.w3.org/2002/07/owl#

PREFIX  xsd:  http://www.w3.org/2001/XMLSchema#

PREFIX  skos: http://www.w3.org/2004/02/skos/core#

PREFIX  rdfs: http://www.w3.org/2000/01/rdf-schema#

PREFIX  text: http://jena.apache.org/text#



SELECT DISTINCT  ?this ?json

WHERE

  { ?this  rdf:type  fhir:CodeSystem .

    ?this fhir:Resource.jsonContent/fhir:value ?json .

    ?this fhir:CodeSystem.name/text:query (tes:indexedValue '*Allergy*')

  }



*         and with "FILTER contains" clause:



PREFIX  tes:  http://cgm.com/tes/

PREFIX  fhir: http://hl7.org/fhir/

PREFIX  rdf:  http://www.w3.org/1999/02/22-rdf-syntax-ns#

PREFIX  owl:  http://www.w3.org/2002/07/owl#

PREFIX  xsd:  http://www.w3.org/2001/XMLSchema#

PREFIX  skos: http://www.w3.org/2004/02/skos/core#

PREFIX  rdfs: http://www.w3.org/2000/01/rdf-schema#

PREFIX  text: http://jena.apache.org/text#



SELECT DISTINCT  ?this ?json

WHERE

  { ?this  rdf:type  fhir:CodeSystem .

    ?this fhir:Resource.jsonContent/fhir:value ?json .

    ?this fhir:CodeSystem.name/tes:indexedValue ?name FILTER contains(?name, 
"Allergy")

  }
==========================================================================================

Log from fuseki:



15:19:33 INFO  Fuseki          :: [4] POST http://localhost:3030/tes/sparql

15:19:33 INFO  Fuseki          :: [4] Query = PREFIX  tes:  
http://mycomany/tes/ PREFIX  fhir: http://hl7.org/fhir/ PREFIX  rdf:  
http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX  owl:  
http://www.w3.org/2002/07/owl# PREFIX  xsd:  http://www.w3.org/2001/XMLSchema# 
PREFIX  skos: http://www.w3.org/2004/02/skos/core# PREFIX  rdfs: 
http://www.w3.org/2000/01/rdf-schema# PREFIX  text: 
http://jena.apache.org/text#  SELECT DISTINCT  ?this ?json WHERE   { ?this  
rdf:type  fhir:CodeSystem .     ?this fhir:Resource.jsonContent/fhir:value 
?json .      ?this fhir:CodeSystem.name/tes:indexedValue ?name FILTER 
contains(?name, "Allergy")   }

15:19:33 INFO  Fuseki          :: [4] 200 OK (55 ms)



15:20:25 INFO  Fuseki          :: [5] POST http://localhost:3030/tes/sparql

15:20:25 INFO  Fuseki          :: [5] Query = PREFIX  tes:  
http://mycomany/tes/ PREFIX  fhir: http://hl7.org/fhir/ PREFIX  rdf:  
http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX  owl:  
http://www.w3.org/2002/07/owl# PREFIX  xsd:  http://www.w3.org/2001/XMLSchema# 
PREFIX  skos: http://www.w3.org/2004/02/skos/core# PREFIX  rdfs: 
http://www.w3.org/2000/01/rdf-schema# PREFIX  text: 
http://jena.apache.org/text#  SELECT DISTINCT  ?this ?json WHERE   { ?this  
rdf:type  fhir:CodeSystem .     ?this fhir:Resource.jsonContent/fhir:value 
?json .      ?this fhir:CodeSystem.name/text:query (tes:indexedValue 
'*Allergy*')   }

15:20:36 INFO  Fuseki          :: [5] 200 OK (10,888 s)
==========================================================================================

There is no difference between standard and docker installations.
I even found bug https://issues.apache.org/jira/browse/JENA-999 regarding 
performance, which is already fixed in version 3.1.0 , while I'm currently 
using version 4.4.0.
Did anyone notice the same problem?
Or maybe I'm doing something wrong?
Or I must do some additional magic configuration?
Is there any solution for this problem?

smime.p7s
Description: S/MIME cryptographic signature

Jena Full Text Search poor performance

Reply via email to