Re: Re: Jena Full Text Search poor performance

Lorenz Buehmann Thu, 16 Jun 2022 00:03:55 -0700

Wouldn't it be already sufficient to move the pattern to the top of thequery? I thought Jena doesn't optimize custom property functions, i.e.won't switch the order of those?


On 15.06.22 22:26, Øyvind Gjesdal wrote:

Hi Pawel,


I think this could be due to the text:query being evaluated late in the
query, and other statements first computing many results, before the text
query limits it down. Maybe the contains filter gets applied earlier?

Would reordering the statements, expanding the property path and/or
enclosing the statement with the text:query in curly brackets help?

SELECT DISTINCT  ?this ?json WHERE   {
   { ?name text:query (tes:indexedValue '*Allergy*') .}
    ?this fhir:CodeSystem.name ?name.
   ?this rdf:type  fhir:CodeSystem .  ?this
fhir:Resource.jsonContent/fhir:value ?json .}

Another approach I use on text queries is using subqueries, for smaller
batched results, but you may have to expand the default text:query lucene
limit to walk through all results.

SELECT DISTINCT  ?this ?json WHERE {
   {SELECT ?name { ?name text:query (tes:indexedValue '*Allergy*') .}
#  LIMIT N OFFSET 0
}
    ?this fhir:CodeSystem.name ?name.
   ?this rdf:type fhir:CodeSystem .?this
fhir:Resource.jsonContent/fhir:value ?json .}

I do use text:query on larger indexes on a similar server configuration,
without experiencing any issues, but I haven't compared results for filter
contains and text:query.

Best regards,
Øyvind

On Wed, Jun 15, 2022 at 3:37 PM Goławski, Paweł <[email protected]>
wrote:

Hi,

I’m trying to use Jena Full Text Search feature according to
https://jena.apache.org/documentation/query/text-query.html

I’ve noticed that queries using “*text:query”* are very slow: ~20 times
slower that similar using “*FILTER contains”* clause.

There are ~5.5M triples in database, 18230 triples with indexed predicate.

Database takes 1.3GB and index 4.2M disc space.

Available memory for fuseki server is 16GB.



My config is quite easy, there is nothing special configured:



*################################################################################################*PREFIX
 :        <#>
PREFIX fuseki:  http://jena.apache.org/fuseki#
PREFIX rdf:     http://www.w3.org/1999/02/22-rdf-syntax-ns#
PREFIX rdfs:    http://www.w3.org/2000/01/rdf-schema#
PREFIX ja:      http://jena.hpl.hp.com/2005/11/Assembler#
PREFIX tdb:     http://jena.hpl.hp.com/2008/tdb#
PREFIX tdb2:    http://jena.apache.org/2016/tdb#
PREFIX text:    http://jena.apache.org/text#
PREFIX skos:    http://www.w3.org/2004/02/skos/core#
PREFIX fhir:    http://hl7.org/fhir/
PREFIX tes:     http://mycompany/tes/

[] rdf:type fuseki:Server ;
    fuseki:*services *(
                        :service
                    ) .

:service rdf:type fuseki:Service ;
                      fuseki:*name *"tes" ;
                      fuseki:*serviceQuery               *"query" , "sparql" ;
*# SPARQL query service                     *fuseki:*serviceUpdate              
*"update" ;
*# SPARQL update service                     *fuseki:*serviceReadWriteGraphStore 
*"data" ;
*# SPARQL Graph store protocol (read and write)                     
*fuseki:*serviceReadGraphStore      *"get" ;
                      fuseki:*serviceUpload              *"upload" ;
                      fuseki:*dataset *:text_dataset ;
.


*# A TextDataset is a regular dataset with a text index.*:text_dataset rdf:type 
   text:TextDataset ;
                           text:*dataset   *:tdb2_dataset_readwrite;
                           text:*index     *:indexLucene ;
.


*# A TDB dataset used for RDF storage*:tdb2_dataset_readwrite rdf:type 
tdb2:DatasetTDB ;
     tdb2:*location  *"databases/db" ;
.


:indexLucene a text:TextIndexLucene ;
      text:*directory *"databases/db-index" ;
      text:*entityMap *:entMap ;
      text:*storeValues *true ;
      text:*analyzer *[
                        a text:StandardAnalyzer ;

*#                       text:stopWords ("the" "a" "an" "and" "but")            
       *] ;

*#    text:queryAnalyzer [ a text:StandardAnalyzer ] ;     *text:*queryParser 
*text:QueryParser ;

*# text:multilingualSupport true ; # optional*.

*# Entity map (see documentation for other options)*:entMap a text:EntityMap ;
             text:*defaultField     *"tesValue" ;
             text:*entityField      *"uri" ;
             text:*uidField         *"uid" ;
             text:*langField        *"lang" ;
             text:*graphField       *"graph" ;
             text:*map *(
                          [ text:*field *"tesValue" ;
                            text:*predicate *tes:indexedValue
                          ]
                      )
.

*################################################################################################*



There are very similar SPARQL queries:

·         with “text:query” clause:



PREFIX  tes:  http://mycompany/tes/

PREFIX  fhir: http://hl7.org/fhir/

PREFIX  rdf:  http://www.w3.org/1999/02/22-rdf-syntax-ns#

PREFIX  owl:  http://www.w3.org/2002/07/owl#

PREFIX  xsd:  http://www.w3.org/2001/XMLSchema#

PREFIX  skos: http://www.w3.org/2004/02/skos/core#

PREFIX  rdfs: http://www.w3.org/2000/01/rdf-schema#

PREFIX  text: http://jena.apache.org/text#



SELECT DISTINCT  ?this ?json

WHERE

   { ?this  rdf:type  fhir:CodeSystem .

     ?this fhir:Resource.jsonContent/fhir:value ?json .

     ?this fhir:CodeSystem.name/text:query (tes:indexedValue '*Allergy*')

   }



·         and with “*FILTER contains”* clause:



PREFIX  tes:  http://cgm.com/tes/

PREFIX  fhir: http://hl7.org/fhir/

PREFIX  rdf:  http://www.w3.org/1999/02/22-rdf-syntax-ns#

PREFIX  owl:  http://www.w3.org/2002/07/owl#

PREFIX  xsd:  http://www.w3.org/2001/XMLSchema#

PREFIX  skos: http://www.w3.org/2004/02/skos/core#

PREFIX  rdfs: http://www.w3.org/2000/01/rdf-schema#

PREFIX  text: http://jena.apache.org/text#



SELECT DISTINCT  ?this ?json

WHERE

   { ?this  rdf:type  fhir:CodeSystem .

     ?this fhir:Resource.jsonContent/fhir:value ?json .

     ?this fhir:CodeSystem.name/tes:indexedValue ?name FILTER contains(?name, 
"Allergy")

   }


==========================================================================================

Log from fuseki:



15:19:33 INFO  Fuseki          :: [4] POST http://localhost:3030/tes/sparql

15:19:33 INFO  Fuseki          :: [4] Query = PREFIX  tes:  http://mycomany/tes/ PREFIX  
fhir: http://hl7.org/fhir/ PREFIX  rdf:  http://www.w3.org/1999/02/22-rdf-syntax-ns# 
PREFIX  owl:  http://www.w3.org/2002/07/owl# PREFIX  xsd:  
http://www.w3.org/2001/XMLSchema# PREFIX  skos: http://www.w3.org/2004/02/skos/core# 
PREFIX  rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX  text: 
http://jena.apache.org/text#  SELECT DISTINCT  ?this ?json WHERE   { ?this  rdf:type  
fhir:CodeSystem .     ?this fhir:Resource.jsonContent/fhir:value ?json .      ?this 
fhir:CodeSystem.name/tes:indexedValue ?name FILTER contains(?name, "Allergy")   
}

15:19:33 INFO  Fuseki          :: [4] 200 OK (55 ms)



15:20:25 INFO  Fuseki          :: [5] POST http://localhost:3030/tes/sparql

15:20:25 INFO  Fuseki          :: [5] Query = PREFIX  tes:  
http://mycomany/tes/ PREFIX  fhir: http://hl7.org/fhir/ PREFIX  rdf:  
http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX  owl:  
http://www.w3.org/2002/07/owl# PREFIX  xsd:  http://www.w3.org/2001/XMLSchema# 
PREFIX  skos: http://www.w3.org/2004/02/skos/core# PREFIX  rdfs: 
http://www.w3.org/2000/01/rdf-schema# PREFIX  text: 
http://jena.apache.org/text#  SELECT DISTINCT  ?this ?json WHERE   { ?this  
rdf:type  fhir:CodeSystem .     ?this fhir:Resource.jsonContent/fhir:value 
?json .      ?this fhir:CodeSystem.name/text:query (tes:indexedValue 
'*Allergy*')   }

15:20:36 INFO  Fuseki          :: [5] 200 OK (10,888 s)


==========================================================================================



There is no difference between standard and docker installations.

I even found bug https://issues.apache.org/jira/browse/JENA-999 regarding
performance, which is already fixed in version 3.1.0 , while I’m currently
using version 4.4.0.

Did anyone notice the same problem?

Or maybe I’m doing something wrong?

Or I must do some additional magic configuration?

Is there any solution for this problem?

Re: Re: Jena Full Text Search poor performance

Reply via email to