Re: Re: SPARQL limit doesn't work

2022-10-19 Thread Lorenz Buehmann



On 19.10.22 13:44, Mikael Pesonen wrote:




On 19/10/2022 10.18, Lorenz Buehmann wrote:
Honestly - probably because of lack of knowledge - I don't see how 
that can happen with the text index. You have a single triple pattern 
that is querying the Lucene index for the given pattern and returns 
by default at most 10 000 documents.



text:query (skos:prefLabel skos:altLabel "\"xx yy\"" "lang:en" )

translates to


( (prefLabel:"\"xx yy\"" OR altLabel:"\"xx yy\"") AND lang:en)
which indeed can return duplicate documents as for each triple a 
separate document is created and indexed.


I still don't get how a query with limit 1000 returning 560 then 
doesn't return 100 if using limit 100


Currently, I find your results quite counter intuitive, but I still 
have to learn a log when using RDF, SPARQL and Jena.



Can you share some data please to reproduce?
Unfortunately I can't share the data. Of course when time, I could 
create similar dummy index.


What happens for a single property only? 


What does this mean?
you're querying two properties aka two fields in the Lucene query. What 
if you just use skos:prefLabel ?


Pagination should work as you're doing, the Lucene query is 
internally executed once, then cached - for later requests the same 
Lucene documents hits should be reused


On 19.10.22 08:21, Mikael Pesonen wrote:


Hi,

yes, same select as only query gets exactly limit amount of triples.

On 18/10/2022 16.48, Lorenz Buehmann wrote:
did you get those results when running only this subquery? Afaik, 
the default limit of the Lucene text query is at most 10 000 
documents - and I don't think that the outer LIMIT would make it to 
the Lucene request



On 18.10.22 13:35, Mikael Pesonen wrote:


I have a bigger query that starts with inner select

 { SELECT ?s ?score WHERE {
    (?s ?score) text:query (skos:prefLabel skos:altLabel "\"xx 
yy\"" "lang:en" ) .

    } order by desc(?score) offset 0 limit 1000 }

There are about 1 results. limit 1000 returns ~560 and limit 
100 ~75 results. How do I page results correctly?






Re: TDB2 Data Doesn't Persist

2022-10-19 Thread Andy Seaborne



On 18/10/2022 23:11, theodore.hi...@morganstanley.com wrote:

Hi Andy,

I seem to have a configuration that's working, where my named graphs are 
persisting. See new config file below. However, it's horribly slow. I am 
running Java code which is reading 60,000 records and running SPARQL queries 
and updates viathe REST API. It will eventually will produce 5 million triples. 
The further along it gets, the slower it's working. It's about 1/3 done now and 
down to less than 1 update per second.

My ja:rdfsSchema file is large. It contains my full ontology and some reference 
data, which is large (about 90K axions).  I'm not sure that's a factor, 
however, since the slowdown seems to be based on the number of triples loaded. 
But I still don't understand this schema stuff, since inferencing isn't going 
to work with my query unless I've also loaded the full ontology into a named 
graph that the query references. What really needs to be in the schema file?

Appreciate any help.


If you want only
  ?s rdfs:subClassOf+ ?c .

you can ask that in the SPARQL query - no inference needed - except it 
won't get reused across queries.


If you add ja:DatasetRDFS, you don't need, and shouldn't use, the "+".

> and some reference data

ja:DatasetRDFS assumes schema and data are separate.

It isn't a general RDFS reasoner - it assumes a fixed schema, and large 
data.


Andy



Thanks,
Ted

Theodore Hills
Consultant | Research
Phone: +1 212 296-1833
theodore.hi...@morganstanley.com

## Fuseki Server configuration file.

PREFIX fuseki:  
PREFIX rdf: 
PREFIX rdfs:
PREFIX tdb1:
PREFIX tdb2:
PREFIX ja:  
PREFIX :<#>


<#service1> rdf:type fuseki:Service ;
 fuseki:name   "msro" ;   # http://host:port/msro
 fuseki:endpoint [ # http://host:port/msro/sparql?query=
 fuseki:operation fuseki:query ;
 fuseki:name "sparql"
 ] ;
 fuseki:endpoint [ # http://host:port/msro/query?query=
  # SPARQL query service (alt name)
 fuseki:operation fuseki:query ;
 fuseki:name "query"
 ] ;

 fuseki:endpoint [ # http://host:port/msro/update?query=
  # SPARQL update service
 fuseki:operation fuseki:update ;
 fuseki:name "update"
 ] ;

 fuseki:endpoint [
  # SPARQL Graph Store Protocol (read)
 fuseki:operation fuseki:gsp_r ;
 fuseki:name "get"
 ] ;
 fuseki:endpoint [
 # SPARQL Graph Store Protcol (read and write)
 # http://host:port/msro/data?default or 
http://host:port/msro/data?graph=
 fuseki:operation fuseki:gsp_rw ;
 fuseki:name "data"
 ] ;

 fuseki:dataset  <#dataset> ;
 .

<#dataset> rdf:type ja:DatasetRDFS ;
   ja:rdfsSchema 
 ;
   ja:dataset <#actualDataset> ;
   .

<#actualDataset> rdf:type tdb2:DatasetTDB2;  # for example.
  tdb2:location "C:\\Users\\hillsthe\\run\\data\\tdb2";
  .


-Original Message-
From: Hills, Theodore (Research)
Sent: Monday, October 17, 2022 6:02 PM
To: Andy Seaborne 
Cc: users@jena.apache.org
Subject: RE: TDB2 Data Doesn't Persist

Hi Andy,

Thanks again for the reply. This seems quite odd to me, so let me check my 
understanding.

When persistence is not an issue, RDFS inferencing without axioms works fine in 
a named graph--actually, across three named graphs. I use one of the named 
graphs for the vocabulary, and two for data.

But if I want my named graph to persist, and because I am using inferencing, I 
have to define my named graph in the assembler (config file).

Question: Do I need to put the named graph URL in the config file? If so, I 
can't tell from the documentation where to put that. Insights appreciated.

Thanks,
Ted

Theodore Hills
Consultant | Research
Phone: +1 212 296-1833
theodore.hi...@morganstanley.com
-Original Message-
From: Andy Seaborne 
Sent: Monday, October 17, 2022 4:25 PM
To: Hills, Theodore (Research) 
Cc: users@jena.apache.org
Subject: Re: TDB2 Data Doesn't Persist

Hi Theordore,

<#dataset> rdf:type ja:RDFDataset;
   ja:defaultGraph <#inferenceModel>
   .

defines a dataset with just a default graph. It does not pass down named 
graphs. They end up in a temporary holder.

You can add the name graph to the ja:RDFDataset but you have to know it's name 
and define the inference. Inference is per-graph.

There is

https://jena.apache.org/documentation/rdfs/

# RDFS
:rdfsDataset rdf:type ja:DatasetRDFS ;
  ja:rdfsSchema ;
  ja:dataset --- some TDB2 database ---
  .

This applies RDFS - without axioms - so rdfs:subClassOf, rdf:subPropertyOf, 
rdfs:domain and rdfs:range - to all graphs in the ja:database.

The schema is fixed at startup.

  Andy

On 17/10/2022 14:39, theodore.hi...@morga

Re: TDB2 Data Doesn't Persist

2022-10-19 Thread Andy Seaborne



On 17/10/2022 23:02, theodore.hi...@morganstanley.com wrote:

Hi Andy,

Thanks again for the reply. This seems quite odd to me, so let me check my 
understanding.

When persistence is not an issue, RDFS inferencing without axioms works fine in 
a named graph--actually, across three named graphs. I use one of the named 
graphs for the vocabulary, and two for data.

But if I want my named graph to persist, and because I am using inferencing, I 
have to define my named graph in the assembler (config file).

Question: Do I need to put the named graph URL in the config file? If so, I 
can't tell from the documentation where to put that. Insights appreciated.


There's no difference - this defines adatset with a default graph and 
nothing else.


<#dataset> rdf:type ja:RDFDataset;
  ja:defaultGraph <#inferenceModel>
  .

<#inferenceModel>
... inference details ...

## Sketch
## Dataset with default graph and a named graph

<#dataset> rdf:type ja:RDFDataset;
  ja:defaultGraph <#inferenceModelDft>;
  ja:namedGraph [
 ja:graphName  ;
 ja:graph <#myInferenceNamedGraph> ] ;
  .

It might be clearer to use ja:graph in place of ja:defaultGraph - used 
like this, they do the same thing.


<#dataset> rdf:type ja:RDFDataset;
  ja:graph <#inferenceModelDft>;
  ja:namedGraph [
 ja:graphName  ;
 ja:graph <#myInferenceNamedGraph> ] ;
  .

Selected named graphs example:

https://github.com/apache/jena/blob/main/jena-fuseki2/examples/tdb2-select-graphs.ttl

The fact the inference has to be repeated (because it will connect to a 
different graph in TDB) is not ideal.


Andy




Thanks,
Ted

Theodore Hills
Consultant | Research
Phone: +1 212 296-1833
theodore.hi...@morganstanley.com
-Original Message-
From: Andy Seaborne 
Sent: Monday, October 17, 2022 4:25 PM
To: Hills, Theodore (Research) 
Cc: users@jena.apache.org
Subject: Re: TDB2 Data Doesn't Persist

Hi Theordore,

<#dataset> rdf:type ja:RDFDataset;
   ja:defaultGraph <#inferenceModel>
   .

defines a dataset with just a default graph. It does not pass down named 
graphs. They end up in a temporary holder.

You can add the name graph to the ja:RDFDataset but you have to know it's name 
and define the inference. Inference is per-graph.

There is

https://jena.apache.org/documentation/rdfs/

# RDFS
:rdfsDataset rdf:type ja:DatasetRDFS ;
  ja:rdfsSchema ;
  ja:dataset --- some TDB2 database ---
  .

This applies RDFS - without axioms - so rdfs:subClassOf, rdf:subPropertyOf, 
rdfs:domain and rdfs:range - to all graphs in the ja:database.

The schema is fixed at startup.

  Andy

On 17/10/2022 14:39, theodore.hi...@morganstanley.com wrote:

Hi Andy,

Thank you for the prompt reply.

I deleted everything in my tdb2 directory in order to force Fuseki to recreate 
the database from scratch so I knew I was working from a clean slate.

I was able to duplicate your results, using my own D.ttl, which is below.

  
"dummy" .

I then loaded the dummy triple into a named graph using the curl command below.

curl -T D.ttl --header 'Content-type: text/turtle' \
  
'http://localhost:3030/msro/data?graph=https%3A%2F%2Ftriples.ms.com%2Fmsro%2F__dummy__'

I was able to dump the named graph using the command below.

curl 
'http://localhost:3030/msro/data?graph=https%3A%2F%2Ftriples.ms.com%2Fmsro%2F__dummy__'

I then stopped and restarted Fuseki, and the triple in the named graph was 
gone. The triple in the default graph was still there.

Also, I monitored the size of Data-0001\nodes-data.obj at every step. Upon 
creation of a fresh database, it was 0 KB. It remained 0 KB even if I loaded my 
dummy triple into a named graph. Only when I loaded my dummy triple into the 
default graph did it grow to 1 KB. It remained 1KB across restarts.

So, the problem seems to be that triples in the default graph persist, but 
triples in named graphs do not.

The only inferencing I need is subclass inferencing in order for a query of 
this form to work:
?s rdfs:subClassOf+ ?c .

Once I layered in the TransitiveReasoner in the config file, the above query 
worked fine in the midst of a complicated and well-tested query that runs fine 
on Stardog and MarkLogic. As mentioned, I was getting the desired results on 
Fuseki TDB2 as well. The only problem seems to be the lack of persistence for 
named graphs.

Thanks so much for your attention to this problem!

Theodore Hills
Consultant | Research
Phone: +1 212 296-1833
theodore.hi...@morganstanley.com
-Original Message-
From: Andy Seaborne 
Sent: Saturday, October 15, 2022 5:41 AM
To: users@jena.apache.org
Subject: Re: TDB2 Data Doesn't Persist

Hi Theodore,

I tried your configuration and got persisted data.

I tried:

curl -T D.ttl --header 'Content-type: text/turtle' \
   'http://localhost:3030/msro/data?defau

Re: SPARQL limit doesn't work

2022-10-19 Thread Mikael Pesonen





On 19/10/2022 10.18, Lorenz Buehmann wrote:
Honestly - probably because of lack of knowledge - I don't see how 
that can happen with the text index. You have a single triple pattern 
that is querying the Lucene index for the given pattern and returns by 
default at most 10 000 documents.



text:query (skos:prefLabel skos:altLabel "\"xx yy\"" "lang:en" )

translates to


( (prefLabel:"\"xx yy\"" OR altLabel:"\"xx yy\"") AND lang:en)
which indeed can return duplicate documents as for each triple a 
separate document is created and indexed.


I still don't get how a query with limit 1000 returning 560 then 
doesn't return 100 if using limit 100


Currently, I find your results quite counter intuitive, but I still 
have to learn a log when using RDF, SPARQL and Jena.



Can you share some data please to reproduce?
Unfortunately I can't share the data. Of course when time, I could 
create similar dummy index.


What happens for a single property only? 


What does this mean?

Pagination should work as you're doing, the Lucene query is internally 
executed once, then cached - for later requests the same Lucene 
documents hits should be reused


On 19.10.22 08:21, Mikael Pesonen wrote:


Hi,

yes, same select as only query gets exactly limit amount of triples.

On 18/10/2022 16.48, Lorenz Buehmann wrote:
did you get those results when running only this subquery? Afaik, 
the default limit of the Lucene text query is at most 10 000 
documents - and I don't think that the outer LIMIT would make it to 
the Lucene request



On 18.10.22 13:35, Mikael Pesonen wrote:


I have a bigger query that starts with inner select

 { SELECT ?s ?score WHERE {
    (?s ?score) text:query (skos:prefLabel skos:altLabel "\"xx 
yy\"" "lang:en" ) .

    } order by desc(?score) offset 0 limit 1000 }

There are about 1 results. limit 1000 returns ~560 and limit 
100 ~75 results. How do I page results correctly?




--
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's 
Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.peso...@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND



Re: fuseki backup process / policy - similar capabilities to autopostgresqlbackup ?

2022-10-19 Thread Ioan Eugen Stan
Hello, 

So I took some time to implement a program to do backups following a policy. 
To implement such a program I think it would be helpful to add the database 
being backed up to the tasks JSON output.

Right now we get.

[ { 
"task" : "Backup" ,
"taskId" : "1" ,
"started" : "2022-10-11T16:25:47.083+00:00"
  }
]

>From this I don't know which DB is being backed up.
It is helpful if you have more tasks in progress to tell which one is done and 
which is in progress.

Regarding backup program, I was checking out how autopostgresql-backup works to 
implement something similar. 
autopostgresql-backup works synchronously. This makes the logic is simple for 
autopostgresql-backup.

On fuseki side, I need to know when the task is done. 
Since the tasks API is async my plan is to pool tasks api and check for db name.
I can also use DB name + date from json reply to form the file name instead of 
parsing it.

Let me know if you have other ideas on how this should be done. 


Re: Re: SPARQL limit doesn't work

2022-10-19 Thread Lorenz Buehmann
Honestly - probably because of lack of knowledge - I don't see how that 
can happen with the text index. You have a single triple pattern that is 
querying the Lucene index for the given pattern and returns by default 
at most 10 000 documents.



text:query (skos:prefLabel skos:altLabel "\"xx yy\"" "lang:en" )

translates to


( (prefLabel:"\"xx yy\"" OR altLabel:"\"xx yy\"") AND lang:en)
which indeed can return duplicate documents as for each triple a 
separate document is created and indexed.


I still don't get how a query with limit 1000 returning 560 then doesn't 
return 100 if using limit 100


Currently, I find your results quite counter intuitive, but I still have 
to learn a log when using RDF, SPARQL and Jena.



Can you share some data please to reproduce?

What happens for a single property only? Pagination should work as 
you're doing, the Lucene query is internally executed once, then cached 
- for later requests the same Lucene documents hits should be reused


On 19.10.22 08:21, Mikael Pesonen wrote:


Hi,

yes, same select as only query gets exactly limit amount of triples.

On 18/10/2022 16.48, Lorenz Buehmann wrote:
did you get those results when running only this subquery? Afaik, the 
default limit of the Lucene text query is at most 10 000 documents - 
and I don't think that the outer LIMIT would make it to the Lucene 
request



On 18.10.22 13:35, Mikael Pesonen wrote:


I have a bigger query that starts with inner select

 { SELECT ?s ?score WHERE {
    (?s ?score) text:query (skos:prefLabel skos:altLabel "\"xx yy\"" 
"lang:en" ) .

    } order by desc(?score) offset 0 limit 1000 }

There are about 1 results. limit 1000 returns ~560 and limit 100 
~75 results. How do I page results correctly?