Re: Apache Jena Fuseki with text indexing

Zhenya Antić Thu, 26 Mar 2020 11:27:21 -0700

Andy,

I think I figured out what the issue is. It seems that I have two datasets with 
the same name, and one was started with the config file I sent (and has no data 
in it - and hence it is not indexed), and the other was started without a 
config file (like this: fuseki-server --port 3030 --loc="db" /biology), and it 
has the data.


How do I transfer the data from one to other?

Thanks,
Zhenya


On Thu, Mar 26, 2020, at 12:22 PM, Chris Tomlinson wrote:
> Zhenya,
> 
> Do you see any content in the directory:
> 
> > text:directory <file:data/luceneIndexing> ;
> 
> like the following partial listing:
> 
> > fuseki@foo :~/base/lucene-test$ ls -l
> > total 3608108
> > -rw-rw---- 1 fuseki fuseki 7772 Jan 29 21:15 _19a_5x.liv
> > -rw-r----- 1 fuseki fuseki 299 Jan 21 15:53 _19a.cfe
> > -rw-r----- 1 fuseki fuseki 36547721 Jan 21 15:53 _19a.cfs
> > -rw-r----- 1 fuseki fuseki 443 Jan 21 15:53 _19a.si
> > -rw-r----- 1 fuseki fuseki 23621 Jan 21 15:53 _24_17n.liv
> > -rw-r----- 1 fuseki fuseki 22718569 Jan 21 15:53 _24.fdt
> > -rw-r----- 1 fuseki fuseki 9184 Jan 21 15:53 _24.fdx
> > -rw-r----- 1 fuseki fuseki 12975 Jan 21 15:53 _24.fnm
> > -rw-r----- 1 fuseki fuseki 7009762 Jan 21 15:53 _24_Lucene50_0.doc
> > -rw-r----- 1 fuseki fuseki 3804794 Jan 21 15:53 _24_Lucene50_0.pos
> > -rw-r----- 1 fuseki fuseki 16186474 Jan 21 15:53 _24_Lucene50_0.tim
> > -rw-r----- 1 fuseki fuseki 103945 Jan 21 15:53 _24_Lucene50_0.tip
> > -rw-r----- 1 fuseki fuseki 667296 Jan 21 15:53 _24.nvd
> > -rw-r----- 1 fuseki fuseki 4027 Jan 21 15:53 _24.nvm
> > -rw-r----- 1 fuseki fuseki 540 Jan 21 15:53 _24.si
> 
> Also if you don’t have storevalues true then queries like:
> 
>  (?s ?score ?lit) text:query “ribosome”
> 
> won’t bind anything to ?lit. The storevalues is set like:
> 
> > # Text index description
> > :test_lucene_index a text:TextIndexLucene ;
> > text:directory <file:/usr/local/fuseki/base/lucene-test> ;
> > text:storeValues true ;
> > text:entityMap :test_entmap ;
> 
> 
> Also you need to reload the data if you change the configuration so that the 
> indexing will be done according to the configuration.
> 
> ciao,
> Chris
> 
> 
> > On Mar 26, 2020, at 10:33 AM, Zhenya Antić <[email protected]> wrote:
> > 
> > @prefix : <http://base/#> .
> > @prefix tdb2: <http://jena.apache.org/2016/tdb#> .
> > @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> > @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
> > @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
> > @prefix fuseki: <http://jena.apache.org/fuseki#> .
> > @prefix text: <http://jena.apache.org/text#> .
> > 
> > <http://jena.apache.org/2016/tdb#DatasetTDB>
> > rdfs:subClassOf ja:RDFDataset .
> > 
> > ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .
> > 
> > tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .
> > 
> > tdb2:GraphTDB2 rdfs:subClassOf ja:Model .
> > 
> > <http://jena.apache.org/2016/tdb#GraphTDB2>
> > rdfs:subClassOf ja:Model .
> > 
> > ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .
> > 
> > ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .
> > 
> > <http://jena.apache.org/text#TextDataset>
> > rdfs:subClassOf ja:RDFDataset .
> > 
> > :service_tdb_all a fuseki:Service ;
> > rdfs:label "TDB biology" ;
> > fuseki:dataset :tdb_dataset_readwrite ;
> > fuseki:name "biology" ;
> > fuseki:serviceQuery "query" , "" , "sparql" ;
> > fuseki:serviceReadGraphStore "get" ;
> > fuseki:serviceReadQuads "" ;
> > fuseki:serviceReadWriteGraphStore
> > "data" ;
> > fuseki:serviceReadWriteQuads "" ;
> > fuseki:serviceUpdate "" , "update" ;
> > fuseki:serviceUpload "upload" .
> > 
> > :tdb_dataset_readwrite
> > a tdb2:DatasetTDB2 ;
> > tdb2:location "db" .
> > 
> > <http://jena.apache.org/2016/tdb#GraphTDB>
> > rdfs:subClassOf ja:Model .
> > 
> > ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .
> > 
> > ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .
> > 
> > <http://jena.apache.org/2016/tdb#DatasetTDB2>
> > rdfs:subClassOf ja:RDFDataset .
> > 
> > <#dataset> rdf:type tdb2:DatasetTDB2 ;
> > tdb2:location "db" ; #path to TDB;
> > .
> > 
> > # Text index description
> > :text_dataset rdf:type text:TextDataset ;
> > text:dataset <#dataset> ; # <-- replace `:my_dataset` with the desired URI
> > text:index <#indexLucene> ;
> > .
> > 
> > <#indexLucene> a text:TextIndexLucene ;
> > text:directory <file:data/luceneIndexing> ;
> > text:entityMap <#entMap> ;
> > .
> > 
> > <#entMap> a text:EntityMap ;
> > text:defaultField "text" ;
> > text:entityField "uri" ;
> > text:map (
> > #RDF label abstracts
> > [ text:field "text" ;
> > text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#label> ;
> > text:analyzer [
> > a text:StandardAnalyzer
> > ] 
> > ]
> > [ text:field "text" ;
> > text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym> ;
> > text:analyzer [
> > a text:StandardAnalyzer
> > ] 
> > ]
> > ) .
> > 
> > 
> > 
> > <#service_text_tdb> rdf:type fuseki:Service ;
> > fuseki:name "ds" ;
> > fuseki:serviceQuery "query" ;
> > fuseki:serviceQuery "sparql" ;
> > fuseki:serviceUpdate "update" ;
> > fuseki:serviceUpload "upload" ;
> > fuseki:serviceReadGraphStore "get" ;
> > fuseki:serviceReadWriteGraphStore "data" ;
> > fuseki:dataset :text_dataset ;
> > .
> > 
> > 
> > 
> > On Thu, Mar 26, 2020, at 11:31 AM, Zhenya Antić wrote:
> >> Hi Andy,
> >> 
> >> Thanks. So I think I have all the lines you listed in the .ttl file 
> >> (attached). I also checked, the data file contains the relevant data. But 
> >> I have 0 properties indexed.
> >> 
> >> Thanks,
> >> Zhenya
> >> 
> >> 
> >> 
> >> On Wed, Mar 25, 2020, at 4:41 AM, Andy Seaborne wrote:
> >>> 
> >>> 
> >>> On 24/03/2020 15:11, Zhenya Antić wrote:
> >>>> Hi Andy,
> >>>> 
> >>>>> Did you load the data before attaching the text index?
> >>>> 
> >>>> How do I do it (or not do it, wasn't sure from your post)?
> >>> 
> >>> Set up the Fueski system, with the text index as the Fuskei service 
> >>> dataset:
> >>> 
> >>> fuseki:name "biology" ;
> >>> fuseki:dataset :text_dataset ;
> >>> ...
> >>> 
> >>> :text_dataset rdf:type text:TextDataset ;
> >>> text:dataset <#dataset> ;
> >>> 
> >>> 
> >>> 
> >>> <#dataset> rdf:type tdb2:DatasetTDB2 ;
> >>> tdb2:location "db" ; #path to TDB;
> >>> .
> >>> 
> >>> then send the data to /biology/data (which is the SPARQl GSP write 
> >>> endpoint) or however you want to push the data to the server (SPARQL 
> >>> Update, or the UI.
> >>> 
> >>> For very large data:
> >>> 
> >>> Load the TDB2 dataset offline
> >>> Then run the "jena.textindexer" utility
> >>> 
> >>> https://jena.apache.org/documentation/query/text-query.html#configuration
> >>> 
> >>> The first way is easier.
> >>> 
> >>> Andy
> >>> 
> >>>> 
> >>>> Thanks,
> >>>> Zhenya
> >>>> 
> >>>> 
> >>>> 
> >>>> On Sun, Mar 22, 2020, at 9:18 AM, Andy Seaborne wrote:
> >>>>> Just checking one point:
> >>>>> 
> >>>>> Did you load the data before attaching the text index?
> >>>>> 
> >>>>> The text index is calculated as data is added so if you first load the
> >>>>> dataset then setup a text index, it will miss indexing the data.
> >>>>> 
> >>>>> Andy
> >>>>> 
> >>>>> On 21/03/2020 07:55, Lorenz Buehmann wrote:
> >>>>>> Hi,
> >>>>>> 
> >>>>>> welcome to Semantic Web and Apache Jena.
> >>>>>> 
> >>>>>> Comments inline:
> >>>>>> 
> >>>>>> On 20.03.20 15:36, Zhenya Antić wrote:
> >>>>>>> Hello,
> >>>>>>> 
> >>>>>>> I am a beginner with Fuseki, knowledge graphs and SPARQL, so please 
> >>>>>>> forgive me if the questions seem obvious, the learning curve for this 
> >>>>>>> turned out to be quite steep.
> >>>>>> No problem, nothing is simple in the beginning,
> >>>>>>> 
> >>>>>>> I am trying to get text indexing to work with my Fuseki knowledge 
> >>>>>>> graph.
> >>>>>> Which DBpedia dataset did you load? I mean, which files?
> >>>>>>> 
> >>>>>>> For starters, I tried using a regular expression, but that didn't 
> >>>>>>> work:
> >>>>>>> 
> >>>>>>> Just a plain query like this:
> >>>>>>> SELECT DISTINCT * WHERE {
> >>>>>>> ?s ?p ?o
> >>>>>>> }
> >>>>>>> gives 98 results such as:
> >>>>>>> 
> >>>>>>> 1
> >>>>>>> <http://dbpedia.org/ontology/wikiPageID:9127632>
> >>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
> >>>>>>> <http://dbpedia.org/resource/Biology>
> >>>>>>> 2
> >>>>>>> <http://dbpedia.org/ontology/wikiPageID:9127632>
> >>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
> >>>>>>> <http://dbpedia.org/resource/Biology#Branches>
> >>>>>>> 3
> >>>>>>> <http://dbpedia.org/ontology/wikiPageID:9127632>
> >>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym>
> >>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#branches_of_biology>
> >>>>>>> 4
> >>>>>>> <http://dbpedia.org/ontology/wikiPageID:18393>
> >>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
> >>>>>>> <http://dbpedia.org/resource/Life>
> >>>>>> That can't be the correct output of this query. rdfs:label should 
> >>>>>> return
> >>>>>> literals as object (?o) - or you loaded some really weird data
> >>>>>>> 
> >>>>>>> But a query with a regular expression:
> >>>>>>> SELECT DISTINCT * WHERE {
> >>>>>>> ?s ?p ?o
> >>>>>>> FILTER regex(?o, "Biol", "i")
> >>>>>>> }
> >>>>>> 
> >>>>>> 1. you should help the query engine and use rdfs:label as property
> >>>>>> 
> >>>>>> 2. you should use str() function on the ?o values:
> >>>>>> 
> >>>>>> SELECT DISTINCT * WHERE {
> >>>>>> ?s rdfs:label ?o
> >>>>>> FILTER regex(str(?o), "Biol", "i")
> >>>>>> }
> >>>>>> 
> >>>>>>> gives 0 results, although there are clearly results that contain 
> >>>>>>> "Biol".
> >>>>>> 
> >>>>>> 
> >>>>>> I've to try your config or maybe others will spot the issue in the 
> >>>>>> meantime.
> >>>>>> 
> >>>>>>> 
> >>>>>>> I also tried setting up indexing with a .ttl file, however the result 
> >>>>>>> was "INFO 0 (0 per second) properties indexed". .ttl file below:
> >>>>>>> 
> >>>>>>> @prefix : <http://base/#> .
> >>>>>>> @prefix tdb2: <http://jena.apache.org/2016/tdb#> .
> >>>>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> >>>>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
> >>>>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
> >>>>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
> >>>>>>> @prefix text: <http://jena.apache.org/text#> .
> >>>>>>> 
> >>>>>>> <http://jena.apache.org/2016/tdb#DatasetTDB>
> >>>>>>> rdfs:subClassOf ja:RDFDataset .
> >>>>>>> 
> >>>>>>> ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .
> >>>>>>> 
> >>>>>>> tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .
> >>>>>>> 
> >>>>>>> tdb2:GraphTDB2 rdfs:subClassOf ja:Model .
> >>>>>>> 
> >>>>>>> <http://jena.apache.org/2016/tdb#GraphTDB2>
> >>>>>>> rdfs:subClassOf ja:Model .
> >>>>>>> 
> >>>>>>> ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .
> >>>>>>> 
> >>>>>>> ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .
> >>>>> 
> >>>>> The rdfs:subClassOf should not be necessary (recent versions of Fuseki).
> >>>>> 
> >>>>> If any are, let's use know so it can be fixed.
> >>>>> 
> >>>>>>> 
> >>>>>>> <http://jena.apache.org/text#TextDataset>
> >>>>>>> rdfs:subClassOf ja:RDFDataset .
> >>>>>>> 
> >>>>>>> :service_tdb_all a fuseki:Service ;
> >>>>>>> rdfs:label "TDB biology" ;
> >>>>>>> fuseki:dataset :tdb_dataset_readwrite ;
> >>>>>>> fuseki:name "biology" ;
> >>>>>>> fuseki:serviceQuery "query" , "" , "sparql" ;
> >>>>>>> fuseki:serviceReadGraphStore "get" ;
> >>>>>>> fuseki:serviceReadQuads "" ;
> >>>>>>> fuseki:serviceReadWriteGraphStore
> >>>>>>> "data" ;
> >>>>>>> fuseki:serviceReadWriteQuads "" ;
> >>>>>>> fuseki:serviceUpdate "" , "update" ;
> >>>>>>> fuseki:serviceUpload "upload" .
> >>>>>>> 
> >>>>>>> :tdb_dataset_readwrite
> >>>>>>> a tdb2:DatasetTDB2 ;
> >>>>>>> tdb2:location "db" .
> >>>>>>> 
> >>>>>>> <http://jena.apache.org/2016/tdb#GraphTDB>
> >>>>>>> rdfs:subClassOf ja:Model .
> >>>>>>> 
> >>>>>>> ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .
> >>>>>>> 
> >>>>>>> ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .
> >>>>>>> 
> >>>>>>> <http://jena.apache.org/2016/tdb#DatasetTDB2>
> >>>>>>> rdfs:subClassOf ja:RDFDataset .
> >>>>>>> 
> >>>>>>> <#dataset> rdf:type tdb2:DatasetTDB2 ;
> >>>>>>> tdb2:location "db" ; #path to TDB;
> >>>>>>> .
> >>>>>>> 
> >>>>>>> # Text index description
> >>>>>>> :text_dataset rdf:type text:TextDataset ;
> >>>>>>> text:dataset <#dataset> ; # <-- replace `:my_dataset` with the 
> >>>>>>> desired URI
> >>>>>>> text:index <#indexLucene> ;
> >>>>>>> .
> >>>>>>> 
> >>>>>>> <#indexLucene> a text:TextIndexLucene ;
> >>>>>>> text:directory <file:data/luceneIndexing> ;
> >>>>>>> text:entityMap <#entMap> ;
> >>>>>>> .
> >>>>>>> 
> >>>>>>> <#entMap> a text:EntityMap ;
> >>>>>>> text:defaultField "text" ;
> >>>>>>> text:entityField "uri" ;
> >>>>>>> text:map (
> >>>>>>> #RDF label abstracts
> >>>>>>> [ text:field "text" ;
> >>>>>>> text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#label> ;
> >>>>>>> text:analyzer [
> >>>>>>> a text:StandardAnalyzer
> >>>>>>> ]
> >>>>>>> ]
> >>>>>>> [ text:field "text" ;
> >>>>>>> text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym> ;
> >>>>>>> text:analyzer [
> >>>>>>> a text:StandardAnalyzer
> >>>>>>> ]
> >>>>>>> ]
> >>>>>>> ) .
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> <#service_text_tdb> rdf:type fuseki:Service ;
> >>>>>>> fuseki:name "ds" ;
> >>>>>>> fuseki:serviceQuery "query" ;
> >>>>>>> fuseki:serviceQuery "sparql" ;
> >>>>>>> fuseki:serviceUpdate "update" ;
> >>>>>>> fuseki:serviceUpload "upload" ;
> >>>>>>> fuseki:serviceReadGraphStore "get" ;
> >>>>>>> fuseki:serviceReadWriteGraphStore "data" ;
> >>>>>>> fuseki:dataset :text_dataset ;
> >>>>>>> .
> >>>>>>> 
> >>>>>>> Thank you so much in advance,
> >>>>>>> 
> >>>>>>> __________________________
> >>>>>>> Zhenya Antić, PhD
> >>>>>>> Natural Language Processing
> >>>>>>> https://www.linkedin.com/in/zhenya-antic/
> >>>>>>> 
> >>>>>>> Practical Linguistics Inc
> >>>>>>> http://www.practicallinguistics.com
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>>> 
> >>>> 
> >>> 
> >> 
> 
>

Re: Apache Jena Fuseki with text indexing

Reply via email to