Re: Apache Jena Fuseki with text indexing

Andy Seaborne Wed, 01 Apr 2020 04:04:39 -0700



On 26/03/2020 18:26, Zhenya Antić wrote:

Andy,

I think I figured out what the issue is. It seems that I have two datasets with the same 
name, and one was started with the config file I sent (and has no data in it - and hence 
it is not indexed), and the other was started without a config file (like this: 
fuseki-server --port 3030 --loc="db" /biology), and it has the data.

How do I transfer the data from one to other?


The safest way is to reload.

You can copy files aroudn when the server is not running (for both TDBand Lucene) but obviously that's error prone and only works if thetarget is empty.


    Andy


Thanks,
Zhenya


On Thu, Mar 26, 2020, at 12:22 PM, Chris Tomlinson wrote:

Zhenya,

Do you see any content in the directory:

text:directory <file:data/luceneIndexing> ;


like the following partial listing:

fuseki@foo :~/base/lucene-test$ ls -l
total 3608108
-rw-rw---- 1 fuseki fuseki 7772 Jan 29 21:15 _19a_5x.liv
-rw-r----- 1 fuseki fuseki 299 Jan 21 15:53 _19a.cfe
-rw-r----- 1 fuseki fuseki 36547721 Jan 21 15:53 _19a.cfs
-rw-r----- 1 fuseki fuseki 443 Jan 21 15:53 _19a.si
-rw-r----- 1 fuseki fuseki 23621 Jan 21 15:53 _24_17n.liv
-rw-r----- 1 fuseki fuseki 22718569 Jan 21 15:53 _24.fdt
-rw-r----- 1 fuseki fuseki 9184 Jan 21 15:53 _24.fdx
-rw-r----- 1 fuseki fuseki 12975 Jan 21 15:53 _24.fnm
-rw-r----- 1 fuseki fuseki 7009762 Jan 21 15:53 _24_Lucene50_0.doc
-rw-r----- 1 fuseki fuseki 3804794 Jan 21 15:53 _24_Lucene50_0.pos
-rw-r----- 1 fuseki fuseki 16186474 Jan 21 15:53 _24_Lucene50_0.tim
-rw-r----- 1 fuseki fuseki 103945 Jan 21 15:53 _24_Lucene50_0.tip
-rw-r----- 1 fuseki fuseki 667296 Jan 21 15:53 _24.nvd
-rw-r----- 1 fuseki fuseki 4027 Jan 21 15:53 _24.nvm
-rw-r----- 1 fuseki fuseki 540 Jan 21 15:53 _24.si


Also if you don’t have storevalues true then queries like:

  (?s ?score ?lit) text:query “ribosome”

won’t bind anything to ?lit. The storevalues is set like:

# Text index description
:test_lucene_index a text:TextIndexLucene ;
text:directory <file:/usr/local/fuseki/base/lucene-test> ;
text:storeValues true ;
text:entityMap :test_entmap ;



Also you need to reload the data if you change the configuration so that the 
indexing will be done according to the configuration.

ciao,
Chris

On Mar 26, 2020, at 10:33 AM, Zhenya Antić <[email protected]> wrote:

@prefix : <http://base/#> .
@prefix tdb2: <http://jena.apache.org/2016/tdb#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix text: <http://jena.apache.org/text#> .

<http://jena.apache.org/2016/tdb#DatasetTDB>
rdfs:subClassOf ja:RDFDataset .

ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .

tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .

tdb2:GraphTDB2 rdfs:subClassOf ja:Model .

<http://jena.apache.org/2016/tdb#GraphTDB2>
rdfs:subClassOf ja:Model .

ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .

ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .

<http://jena.apache.org/text#TextDataset>
rdfs:subClassOf ja:RDFDataset .

:service_tdb_all a fuseki:Service ;
rdfs:label "TDB biology" ;
fuseki:dataset :tdb_dataset_readwrite ;
fuseki:name "biology" ;
fuseki:serviceQuery "query" , "" , "sparql" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:serviceReadQuads "" ;
fuseki:serviceReadWriteGraphStore
"data" ;
fuseki:serviceReadWriteQuads "" ;
fuseki:serviceUpdate "" , "update" ;
fuseki:serviceUpload "upload" .

:tdb_dataset_readwrite
a tdb2:DatasetTDB2 ;
tdb2:location "db" .

<http://jena.apache.org/2016/tdb#GraphTDB>
rdfs:subClassOf ja:Model .

ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .

ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .

<http://jena.apache.org/2016/tdb#DatasetTDB2>
rdfs:subClassOf ja:RDFDataset .

<#dataset> rdf:type tdb2:DatasetTDB2 ;
tdb2:location "db" ; #path to TDB;
.

# Text index description
:text_dataset rdf:type text:TextDataset ;
text:dataset <#dataset> ; # <-- replace `:my_dataset` with the desired URI
text:index <#indexLucene> ;
.

<#indexLucene> a text:TextIndexLucene ;
text:directory <file:data/luceneIndexing> ;
text:entityMap <#entMap> ;
.

<#entMap> a text:EntityMap ;
text:defaultField "text" ;
text:entityField "uri" ;
text:map (
#RDF label abstracts
[ text:field "text" ;
text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#label> ;
text:analyzer [
a text:StandardAnalyzer
]
]
[ text:field "text" ;
text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym> ;
text:analyzer [
a text:StandardAnalyzer
]
]
) .



<#service_text_tdb> rdf:type fuseki:Service ;
fuseki:name "ds" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
fuseki:serviceUpdate "update" ;
fuseki:serviceUpload "upload" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:serviceReadWriteGraphStore "data" ;
fuseki:dataset :text_dataset ;
.



On Thu, Mar 26, 2020, at 11:31 AM, Zhenya Antić wrote:

Hi Andy,

Thanks. So I think I have all the lines you listed in the .ttl file (attached). 
I also checked, the data file contains the relevant data. But I have 0 
properties indexed.

Thanks,
Zhenya



On Wed, Mar 25, 2020, at 4:41 AM, Andy Seaborne wrote:



On 24/03/2020 15:11, Zhenya Antić wrote:

Hi Andy,

Did you load the data before attaching the text index?


How do I do it (or not do it, wasn't sure from your post)?


Set up the Fueski system, with the text index as the Fuskei service dataset:

fuseki:name "biology" ;
fuseki:dataset :text_dataset ;
...

:text_dataset rdf:type text:TextDataset ;
text:dataset <#dataset> ;



<#dataset> rdf:type tdb2:DatasetTDB2 ;
tdb2:location "db" ; #path to TDB;
.

then send the data to /biology/data (which is the SPARQl GSP write
endpoint) or however you want to push the data to the server (SPARQL
Update, or the UI.

For very large data:

Load the TDB2 dataset offline
Then run the "jena.textindexer" utility

https://jena.apache.org/documentation/query/text-query.html#configuration

The first way is easier.

Andy


Thanks,
Zhenya



On Sun, Mar 22, 2020, at 9:18 AM, Andy Seaborne wrote:

Just checking one point:

Did you load the data before attaching the text index?

The text index is calculated as data is added so if you first load the
dataset then setup a text index, it will miss indexing the data.

Andy

On 21/03/2020 07:55, Lorenz Buehmann wrote:

Hi,

welcome to Semantic Web and Apache Jena.

Comments inline:

On 20.03.20 15:36, Zhenya Antić wrote:

Hello,

I am a beginner with Fuseki, knowledge graphs and SPARQL, so please forgive me 
if the questions seem obvious, the learning curve for this turned out to be 
quite steep.

No problem, nothing is simple in the beginning,


I am trying to get text indexing to work with my Fuseki knowledge graph.

Which DBpedia dataset did you load? I mean, which files?


For starters, I tried using a regular expression, but that didn't work:

Just a plain query like this:
SELECT DISTINCT * WHERE {
?s ?p ?o
}
gives 98 results such as:

1
<http://dbpedia.org/ontology/wikiPageID:9127632>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
<http://dbpedia.org/resource/Biology>
2
<http://dbpedia.org/ontology/wikiPageID:9127632>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
<http://dbpedia.org/resource/Biology#Branches>
3
<http://dbpedia.org/ontology/wikiPageID:9127632>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#branches_of_biology>
4
<http://dbpedia.org/ontology/wikiPageID:18393>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
<http://dbpedia.org/resource/Life>

That can't be the correct output of this query. rdfs:label should return
literals as object (?o) - or you loaded some really weird data


But a query with a regular expression:
SELECT DISTINCT * WHERE {
?s ?p ?o
FILTER regex(?o, "Biol", "i")
}


1. you should help the query engine and use rdfs:label as property

2. you should use str() function on the ?o values:

SELECT DISTINCT * WHERE {
?s rdfs:label ?o
FILTER regex(str(?o), "Biol", "i")
}

gives 0 results, although there are clearly results that contain "Biol".



I've to try your config or maybe others will spot the issue in the meantime.


I also tried setting up indexing with a .ttl file, however the result was "INFO 0 (0 
per second) properties indexed". .ttl file below:

@prefix : <http://base/#> .
@prefix tdb2: <http://jena.apache.org/2016/tdb#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix text: <http://jena.apache.org/text#> .

<http://jena.apache.org/2016/tdb#DatasetTDB>
rdfs:subClassOf ja:RDFDataset .

ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .

tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .

tdb2:GraphTDB2 rdfs:subClassOf ja:Model .

<http://jena.apache.org/2016/tdb#GraphTDB2>
rdfs:subClassOf ja:Model .

ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .

ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .


The rdfs:subClassOf should not be necessary (recent versions of Fuseki).

If any are, let's use know so it can be fixed.


<http://jena.apache.org/text#TextDataset>
rdfs:subClassOf ja:RDFDataset .

:service_tdb_all a fuseki:Service ;
rdfs:label "TDB biology" ;
fuseki:dataset :tdb_dataset_readwrite ;
fuseki:name "biology" ;
fuseki:serviceQuery "query" , "" , "sparql" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:serviceReadQuads "" ;
fuseki:serviceReadWriteGraphStore
"data" ;
fuseki:serviceReadWriteQuads "" ;
fuseki:serviceUpdate "" , "update" ;
fuseki:serviceUpload "upload" .

:tdb_dataset_readwrite
a tdb2:DatasetTDB2 ;
tdb2:location "db" .

<http://jena.apache.org/2016/tdb#GraphTDB>
rdfs:subClassOf ja:Model .

ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .

ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .

<http://jena.apache.org/2016/tdb#DatasetTDB2>
rdfs:subClassOf ja:RDFDataset .

<#dataset> rdf:type tdb2:DatasetTDB2 ;
tdb2:location "db" ; #path to TDB;
.

# Text index description
:text_dataset rdf:type text:TextDataset ;
text:dataset <#dataset> ; # <-- replace `:my_dataset` with the desired URI
text:index <#indexLucene> ;
.

<#indexLucene> a text:TextIndexLucene ;
text:directory <file:data/luceneIndexing> ;
text:entityMap <#entMap> ;
.

<#entMap> a text:EntityMap ;
text:defaultField "text" ;
text:entityField "uri" ;
text:map (
#RDF label abstracts
[ text:field "text" ;
text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#label> ;
text:analyzer [
a text:StandardAnalyzer
]
]
[ text:field "text" ;
text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym> ;
text:analyzer [
a text:StandardAnalyzer
]
]
) .



<#service_text_tdb> rdf:type fuseki:Service ;
fuseki:name "ds" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
fuseki:serviceUpdate "update" ;
fuseki:serviceUpload "upload" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:serviceReadWriteGraphStore "data" ;
fuseki:dataset :text_dataset ;
.

Thank you so much in advance,

__________________________
Zhenya Antić, PhD
Natural Language Processing
https://www.linkedin.com/in/zhenya-antic/

Practical Linguistics Inc
http://www.practicallinguistics.com

Re: Apache Jena Fuseki with text indexing

Reply via email to