Re: Apache Jena Fuseki with text indexing

Andy Seaborne Sun, 22 Mar 2020 06:19:10 -0700

Just checking one point:

Did you load the data before attaching the text index?

The text index is calculated as data is added so if you first load thedataset then setup a text index, it will miss indexing the data.


    Andy

On 21/03/2020 07:55, Lorenz Buehmann wrote:

Hi,

welcome to Semantic Web and Apache Jena.

Comments inline:

On 20.03.20 15:36, Zhenya Antić wrote:

Hello,

I am a beginner with Fuseki, knowledge graphs and SPARQL, so please forgive me 
if the questions seem obvious, the learning curve for this turned out to be 
quite steep.

No problem, nothing is simple in the beginning,


I am trying to get text indexing to work with my Fuseki knowledge graph.

Which DBpedia dataset did you load? I mean, which files?


For starters, I tried using a regular expression, but that didn't work:

Just a plain query like this:
SELECT DISTINCT * WHERE {
  ?s ?p ?o
}
gives 98 results such as:

1
<http://dbpedia.org/ontology/wikiPageID:9127632>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
<http://dbpedia.org/resource/Biology>
2
<http://dbpedia.org/ontology/wikiPageID:9127632>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
<http://dbpedia.org/resource/Biology#Branches>
3
<http://dbpedia.org/ontology/wikiPageID:9127632>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#branches_of_biology>
4
<http://dbpedia.org/ontology/wikiPageID:18393>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
<http://dbpedia.org/resource/Life>

That can't be the correct output of this query. rdfs:label should return
literals as object (?o) - or you loaded some really weird data


But a query with a regular expression:
SELECT DISTINCT * WHERE {
  ?s ?p ?o
  FILTER regex(?o, "Biol", "i")
}


1. you should help the query engine and use rdfs:label as property

2. you should use str() function on the ?o values:

SELECT DISTINCT * WHERE {
  ?s rdfs:label ?o
  FILTER regex(str(?o), "Biol", "i")
}

gives 0 results, although there are clearly results that contain "Biol".



I've to try your config or maybe others will spot the issue in the meantime.


I also tried setting up indexing with a .ttl file, however the result was "INFO 0 (0 
per second) properties indexed". .ttl file below:

@prefix : <http://base/#> .
@prefix tdb2: <http://jena.apache.org/2016/tdb#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix text: <http://jena.apache.org/text#> .

<http://jena.apache.org/2016/tdb#DatasetTDB>
  rdfs:subClassOf ja:RDFDataset .

ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .

tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .

tdb2:GraphTDB2 rdfs:subClassOf ja:Model .

<http://jena.apache.org/2016/tdb#GraphTDB2>
  rdfs:subClassOf ja:Model .

ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .

ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .


The rdfs:subClassOf should not be necessary (recent versions of Fuseki).

If any are, let's use know so it can be fixed.


<http://jena.apache.org/text#TextDataset>
  rdfs:subClassOf ja:RDFDataset .

:service_tdb_all a fuseki:Service ;
  rdfs:label "TDB biology" ;
  fuseki:dataset :tdb_dataset_readwrite ;
  fuseki:name "biology" ;
  fuseki:serviceQuery "query" , "" , "sparql" ;
  fuseki:serviceReadGraphStore "get" ;
  fuseki:serviceReadQuads "" ;
  fuseki:serviceReadWriteGraphStore
  "data" ;
  fuseki:serviceReadWriteQuads "" ;
  fuseki:serviceUpdate "" , "update" ;
  fuseki:serviceUpload "upload" .

:tdb_dataset_readwrite
  a tdb2:DatasetTDB2 ;
  tdb2:location "db" .

<http://jena.apache.org/2016/tdb#GraphTDB>
  rdfs:subClassOf ja:Model .

ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .

ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .

<http://jena.apache.org/2016/tdb#DatasetTDB2>
  rdfs:subClassOf ja:RDFDataset .

<#dataset> rdf:type tdb2:DatasetTDB2 ;
tdb2:location "db" ; #path to TDB;
.

# Text index description
:text_dataset rdf:type text:TextDataset ;
  text:dataset <#dataset> ; # <-- replace `:my_dataset` with the desired URI
  text:index <#indexLucene> ;
.

<#indexLucene> a text:TextIndexLucene ;
  text:directory <file:data/luceneIndexing> ;
  text:entityMap <#entMap> ;
  .

<#entMap> a text:EntityMap ;
  text:defaultField "text" ;
  text:entityField "uri" ;
  text:map (
  #RDF label abstracts
  [ text:field "text" ;
  text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#label> ;
  text:analyzer [
  a text:StandardAnalyzer
  ]
  ]
  [ text:field "text" ;
  text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym> ;
  text:analyzer [
  a text:StandardAnalyzer
  ]
  ]
  ) .



<#service_text_tdb> rdf:type fuseki:Service ;
  fuseki:name "ds" ;
  fuseki:serviceQuery "query" ;
  fuseki:serviceQuery "sparql" ;
  fuseki:serviceUpdate "update" ;
  fuseki:serviceUpload "upload" ;
  fuseki:serviceReadGraphStore "get" ;
  fuseki:serviceReadWriteGraphStore "data" ;
  fuseki:dataset :text_dataset ;
  .

Thank you so much in advance,

__________________________
Zhenya Antić, PhD
Natural Language Processing
https://www.linkedin.com/in/zhenya-antic/

Practical Linguistics Inc
http://www.practicallinguistics.com

Re: Apache Jena Fuseki with text indexing

Reply via email to