Wouldn't it be already sufficient to move the pattern to the top of the
query? I thought Jena doesn't optimize custom property functions, i.e.
won't switch the order of those?
On 15.06.22 22:26, Øyvind Gjesdal wrote:
Hi Pawel,
I think this could be due to the text:query being evaluated late in the
query, and other statements first computing many results, before the text
query limits it down. Maybe the contains filter gets applied earlier?
Would reordering the statements, expanding the property path and/or
enclosing the statement with the text:query in curly brackets help?
SELECT DISTINCT ?this ?json WHERE {
{ ?name text:query (tes:indexedValue '*Allergy*') .}
?this fhir:CodeSystem.name ?name.
?this rdf:type fhir:CodeSystem . ?this
fhir:Resource.jsonContent/fhir:value ?json .}
Another approach I use on text queries is using subqueries, for smaller
batched results, but you may have to expand the default text:query lucene
limit to walk through all results.
SELECT DISTINCT ?this ?json WHERE {
{SELECT ?name { ?name text:query (tes:indexedValue '*Allergy*') .}
# LIMIT N OFFSET 0
}
?this fhir:CodeSystem.name ?name.
?this rdf:type fhir:CodeSystem .?this
fhir:Resource.jsonContent/fhir:value ?json .}
I do use text:query on larger indexes on a similar server configuration,
without experiencing any issues, but I haven't compared results for filter
contains and text:query.
Best regards,
Øyvind
On Wed, Jun 15, 2022 at 3:37 PM Goławski, Paweł <[email protected]>
wrote:
Hi,
I’m trying to use Jena Full Text Search feature according to
https://jena.apache.org/documentation/query/text-query.html
I’ve noticed that queries using “*text:query”* are very slow: ~20 times
slower that similar using “*FILTER contains”* clause.
There are ~5.5M triples in database, 18230 triples with indexed predicate.
Database takes 1.3GB and index 4.2M disc space.
Available memory for fuseki server is 16GB.
My config is quite easy, there is nothing special configured:
*################################################################################################*PREFIX
: <#>
PREFIX fuseki: http://jena.apache.org/fuseki#
PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
PREFIX ja: http://jena.hpl.hp.com/2005/11/Assembler#
PREFIX tdb: http://jena.hpl.hp.com/2008/tdb#
PREFIX tdb2: http://jena.apache.org/2016/tdb#
PREFIX text: http://jena.apache.org/text#
PREFIX skos: http://www.w3.org/2004/02/skos/core#
PREFIX fhir: http://hl7.org/fhir/
PREFIX tes: http://mycompany/tes/
[] rdf:type fuseki:Server ;
fuseki:*services *(
:service
) .
:service rdf:type fuseki:Service ;
fuseki:*name *"tes" ;
fuseki:*serviceQuery *"query" , "sparql" ;
*# SPARQL query service *fuseki:*serviceUpdate
*"update" ;
*# SPARQL update service *fuseki:*serviceReadWriteGraphStore
*"data" ;
*# SPARQL Graph store protocol (read and write)
*fuseki:*serviceReadGraphStore *"get" ;
fuseki:*serviceUpload *"upload" ;
fuseki:*dataset *:text_dataset ;
.
*# A TextDataset is a regular dataset with a text index.*:text_dataset rdf:type
text:TextDataset ;
text:*dataset *:tdb2_dataset_readwrite;
text:*index *:indexLucene ;
.
*# A TDB dataset used for RDF storage*:tdb2_dataset_readwrite rdf:type
tdb2:DatasetTDB ;
tdb2:*location *"databases/db" ;
.
:indexLucene a text:TextIndexLucene ;
text:*directory *"databases/db-index" ;
text:*entityMap *:entMap ;
text:*storeValues *true ;
text:*analyzer *[
a text:StandardAnalyzer ;
*# text:stopWords ("the" "a" "an" "and" "but")
*] ;
*# text:queryAnalyzer [ a text:StandardAnalyzer ] ; *text:*queryParser
*text:QueryParser ;
*# text:multilingualSupport true ; # optional*.
*# Entity map (see documentation for other options)*:entMap a text:EntityMap ;
text:*defaultField *"tesValue" ;
text:*entityField *"uri" ;
text:*uidField *"uid" ;
text:*langField *"lang" ;
text:*graphField *"graph" ;
text:*map *(
[ text:*field *"tesValue" ;
text:*predicate *tes:indexedValue
]
)
.
*################################################################################################*
There are very similar SPARQL queries:
· with “text:query” clause:
PREFIX tes: http://mycompany/tes/
PREFIX fhir: http://hl7.org/fhir/
PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
PREFIX owl: http://www.w3.org/2002/07/owl#
PREFIX xsd: http://www.w3.org/2001/XMLSchema#
PREFIX skos: http://www.w3.org/2004/02/skos/core#
PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
PREFIX text: http://jena.apache.org/text#
SELECT DISTINCT ?this ?json
WHERE
{ ?this rdf:type fhir:CodeSystem .
?this fhir:Resource.jsonContent/fhir:value ?json .
?this fhir:CodeSystem.name/text:query (tes:indexedValue '*Allergy*')
}
· and with “*FILTER contains”* clause:
PREFIX tes: http://cgm.com/tes/
PREFIX fhir: http://hl7.org/fhir/
PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
PREFIX owl: http://www.w3.org/2002/07/owl#
PREFIX xsd: http://www.w3.org/2001/XMLSchema#
PREFIX skos: http://www.w3.org/2004/02/skos/core#
PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
PREFIX text: http://jena.apache.org/text#
SELECT DISTINCT ?this ?json
WHERE
{ ?this rdf:type fhir:CodeSystem .
?this fhir:Resource.jsonContent/fhir:value ?json .
?this fhir:CodeSystem.name/tes:indexedValue ?name FILTER contains(?name,
"Allergy")
}
==========================================================================================
Log from fuseki:
15:19:33 INFO Fuseki :: [4] POST http://localhost:3030/tes/sparql
15:19:33 INFO Fuseki :: [4] Query = PREFIX tes: http://mycomany/tes/ PREFIX
fhir: http://hl7.org/fhir/ PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
PREFIX owl: http://www.w3.org/2002/07/owl# PREFIX xsd:
http://www.w3.org/2001/XMLSchema# PREFIX skos: http://www.w3.org/2004/02/skos/core#
PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX text:
http://jena.apache.org/text# SELECT DISTINCT ?this ?json WHERE { ?this rdf:type
fhir:CodeSystem . ?this fhir:Resource.jsonContent/fhir:value ?json . ?this
fhir:CodeSystem.name/tes:indexedValue ?name FILTER contains(?name, "Allergy")
}
15:19:33 INFO Fuseki :: [4] 200 OK (55 ms)
15:20:25 INFO Fuseki :: [5] POST http://localhost:3030/tes/sparql
15:20:25 INFO Fuseki :: [5] Query = PREFIX tes:
http://mycomany/tes/ PREFIX fhir: http://hl7.org/fhir/ PREFIX rdf:
http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX owl:
http://www.w3.org/2002/07/owl# PREFIX xsd: http://www.w3.org/2001/XMLSchema#
PREFIX skos: http://www.w3.org/2004/02/skos/core# PREFIX rdfs:
http://www.w3.org/2000/01/rdf-schema# PREFIX text:
http://jena.apache.org/text# SELECT DISTINCT ?this ?json WHERE { ?this
rdf:type fhir:CodeSystem . ?this fhir:Resource.jsonContent/fhir:value
?json . ?this fhir:CodeSystem.name/text:query (tes:indexedValue
'*Allergy*') }
15:20:36 INFO Fuseki :: [5] 200 OK (10,888 s)
==========================================================================================
There is no difference between standard and docker installations.
I even found bug https://issues.apache.org/jira/browse/JENA-999 regarding
performance, which is already fixed in version 3.1.0 , while I’m currently
using version 4.4.0.
Did anyone notice the same problem?
Or maybe I’m doing something wrong?
Or I must do some additional magic configuration?
Is there any solution for this problem?