Re: Counting all triples performance and named graphs

2022-09-21 Thread Andy Seaborne




On 21/09/2022 08:57, Simon Bin wrote:
> Hi,
>
> we have a data set with 500 million triples. Single named graph, fuseki
> tdb2.
>
> We observe different performance between
>
> select (count(*) as ?cnt) {
>?s ?p ?o
> }
>
> ~4 minutes
>
> select (count(*) as ?cnt) {
>graph  { ?s ?p ?o }
> }
>
> ~3 minutes

Caching.

>
> select (count(*) as ?cnt)
> from  {
>?s ?p ?o
> }
>
> takes forever and longer...?
>
> Especially the last case is surprising, any thoughts?

If you use FROM, there is a dataset created for the request that has one
graph that points to the TDB database but it isn't itself TDB. Every ?s 
?p ?o is read because access makes triples.


If it's direct to TDB, and that includes GRAPH, the count counts rows 
but does not actually fetch the values of ?s ?p ?o.


BindingTDB/BindingNodeId do lazy evaluation of variable values.

There can be multiple FROM - the machinery is general has to cope with that.

FROM != GRAPH.

Andy


Re: Persist SHACL shapes in dataset possible?

2022-09-21 Thread Sebastian Faubel
Hi Andy, Hi Lorenz,

Thank you both for your quick responses! Regarding your question, Andy:

The shapes graph contains a bit less than 3M triples, but will eventually
grow in size as we add validation of measurement units. Executing a
validation using the API currently takes around 18s which is quite
impressive. Also, that you support SHACL SPARQL is really cool.

It would be perfectly fine for us to have the shapes stored in a named
graph in the same Dataset.

Btw. Does TDB performance scale per dataset (my guess) or per graph?

~Sebastian

*Semiodesk GmbH | *Werner-von-Siemens-Str. 6 Geb. 15k, 86159 Augsburg,
Germany | Phone: +49 821 8854401 | Fax: +49 821 8854410 | www.semiodesk.com


This e-mail message may contain confidential or legally privileged
information and is intended only for the use of the intended recipient(s).
Any unauthorized disclosure, dissemination, distribution, copying or the
taking of any action in reliance on the information herein is prohibited.
E-mails are not secure and cannot be guaranteed to be error free as they
can be intercepted, amended, or contain viruses. Anyone who communicates
with us by e-mail is deemed to have accepted these risks. Semiodesk GmbH is
not responsible for errors or omissions in this message and denies any
responsibility for any damage arising from the use of e-mail. Any opinion
and other statement contained in this message and any attachment are solely
those of the author and do not necessarily represent those of the company.


On Wed, Sep 21, 2022 at 11:07 AM Andy Seaborne  wrote:

>
>
> On 21/09/2022 08:00, Lorenz Buehmann wrote:
> > Interesting question. I think currently the Fuseki SHACL service expects
> > a Turtle file for the shapes acording to [1]. I agree that this seems to
> > be rather inefficient.
>
> @Sebastian - how much data is there? (in triples ideally)
> What does 300mb translate into in terms of numbers of triples and shapes?
>
> How long (lcoally from a file) does it take to parse the shapes?
>
> >
> > @Andy:
> >
> > - the code expects only Turtle format, right? RDF/XML would fail to
> > parse the shapes then, correct?
>
> No. The Content-Type header is used - if that fails, the default lang
> provided is used.
>
> > - the shapes are parsed into a Graph, wouldn't it be possible to reuse a
> > named graph of the backend dataset containing the shapes?
>
> Possible to implement, yes.
>
> Do you mean the same dataset as the data? (that is, using the dataset as
> a collection of graphs). Or a different dataset for storing shapes
> separately from the data (maybe default union graph is being used on the
> data).
>
> Both are possible, the first is easier. The second gets into naming
> named graphs in other datasets.
> "http://example/dataset#encoded-named-graph-uri"; could be be used (does
> not work ATM - it needs implementing.)
>
>  > Or would it be
>  > too slow to use shaped from e.g. a TDB backend?
>
> Jena SHACL parses the shapes and then executes from an in-memory
> datastructure (a bit more than the AST but not much) so execution is not
> dipping in and out of the database. It should be fast enough.
>
> The parsing is cacheable.  Cache invalidation is hard if it's a
> different dataset.
>
>
> This would be a good example of adding more capabilities by using a
> Fuseki Module:
>
> https://jena.apache.org/documentation/fuseki2/fuseki-modules
>
> The idea is to make the fuseki-server jar
> (1) Fuseki main
> (2) A set of modules (for now, admin, UI; later - open ended set)
>
> and the user can choose the server functionality by the selection of
> modules.
>
>  Andy
>
> >
> >
> > Cheers,
> >
> > Lorenz
> >
> >
> > [1]
> >
> https://github.com/apache/jena/blob/main/jena-fuseki2/jena-fuseki-core/src/main/java/org/apache/jena/fuseki/servlets/SHACL_Validation.java#L66
> >
> >
> > On 20.09.22 17:22, Sebastian Faubel wrote:
> >> Hello everyone,
> >>
> >> I am using Jena Fuseki 4.6.1 with a dataset that I want to validate
> using
> >> SHACL. I've seen the documentation on the SHACL feature in Apache Jena
> >> Fuseki here:
> >>
> >> https://jena.apache.org/documentation/shacl/
> >>
> >> My issue is that my SHACL shapes graph has around 300mb. Uploading this
> >> every time I want to validate would be pretty slow and inefficient. I
> was
> >> wondering if it is possible to persist the shapes graph in the dataset
> >> somehow?
> >>
> >> Thank you! :)
> >>
> >> ~Sebastian
> >>
> >> *Semiodesk GmbH | *Werner-von-Siemens-Str. 6 Geb. 15k, 86159 Augsburg,
> >> Germany | Phone: +49 821 8854401 | Fax: +49 821 8854410 |
> >> www.semiodesk.com
> >>
> >>
> >> This e-mail message may contain confidential or legally privileged
> >> information and is intended only for the use of the intended
> >> recipient(s).
> >> Any unauthorized disclosure, dissemination, distribution, copying or the
> >> taking of any action in reliance on the information herein is
> prohibited.
> >> E-mails are not secure and cannot be guaranteed to be error free as they
> >> can b

Re: Weird sparql problem

2022-09-21 Thread Mikael Pesonen



Fresh start of the server didn't help. I'll try in a fresh 4.6.1 install 
in few days.


BR

On 21/09/2022 9.15, Lorenz Buehmann wrote:
Weird, only 10M triples and each triple pattern returns only 1 
binding, thus, the size is tiny - honestly I can't think of anything 
except for open connections, but as you mentioned, running the queries 
with only one triple pattern works as expected, so that too many open 
connections shouldn't be an issue most likely.


Can you reproduce this behavior with newer Jena versions like 4.6.1?

Or can you reproduce this on different servers as well?

Is it also stuck of your run the query directly after you restart Fuseki?


On 19.09.22 13:49, Mikael Pesonen wrote:



On 15/09/2022 17.48, Lorenz Buehmann wrote:

Forgot:

- size of result for each triple pattern? Might affect if hash join 
can be used.

It's one row for each.


- your hardware?

Normal server with 16gigs mem.


- is it just the first query after starting Fuseki? Connections have 
been closed? Note, there was also a bug in a recent Jena version, 
but only with TDB and too many open connections. It has been 
resolved with release 4.6.1.

Jena has been running quite a while.


Might not be related, but I'm mentioning all things here nevertheless.


On 15.09.22 11:16, Mikael Pesonen wrote:


This returns one row fast, say :C1

SELECT *
FROM 
WHERE {
   a ?t .
  #?t skos:prefLabel ?l
}


and this too:

SELECT *
FROM 
WHERE {
  # a ?t .
  :C1 skos:prefLabel ?l
}


But this always hangs until timeout

SELECT *
FROM 
WHERE {
   a ?t .
  ?t skos:prefLabel ?l
}

What am I missing here? I'm using Fuseki web GUI. Thanks!




--
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's 
Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.peso...@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND



Re: Persist SHACL shapes in dataset possible?

2022-09-21 Thread Andy Seaborne




On 21/09/2022 08:00, Lorenz Buehmann wrote:
Interesting question. I think currently the Fuseki SHACL service expects 
a Turtle file for the shapes acording to [1]. I agree that this seems to 
be rather inefficient.


@Sebastian - how much data is there? (in triples ideally)
What does 300mb translate into in terms of numbers of triples and shapes?

How long (lcoally from a file) does it take to parse the shapes?



@Andy:

- the code expects only Turtle format, right? RDF/XML would fail to 
parse the shapes then, correct?


No. The Content-Type header is used - if that fails, the default lang 
provided is used.


- the shapes are parsed into a Graph, wouldn't it be possible to reuse a 
named graph of the backend dataset containing the shapes?


Possible to implement, yes.

Do you mean the same dataset as the data? (that is, using the dataset as 
a collection of graphs). Or a different dataset for storing shapes 
separately from the data (maybe default union graph is being used on the 
data).


Both are possible, the first is easier. The second gets into naming 
named graphs in other datasets. 
"http://example/dataset#encoded-named-graph-uri"; could be be used (does 
not work ATM - it needs implementing.)


> Or would it be
> too slow to use shaped from e.g. a TDB backend?

Jena SHACL parses the shapes and then executes from an in-memory 
datastructure (a bit more than the AST but not much) so execution is not 
dipping in and out of the database. It should be fast enough.


The parsing is cacheable.  Cache invalidation is hard if it's a 
different dataset.



This would be a good example of adding more capabilities by using a 
Fuseki Module:


https://jena.apache.org/documentation/fuseki2/fuseki-modules

The idea is to make the fuseki-server jar
(1) Fuseki main
(2) A set of modules (for now, admin, UI; later - open ended set)

and the user can choose the server functionality by the selection of 
modules.


Andy




Cheers,

Lorenz


[1] 
https://github.com/apache/jena/blob/main/jena-fuseki2/jena-fuseki-core/src/main/java/org/apache/jena/fuseki/servlets/SHACL_Validation.java#L66 



On 20.09.22 17:22, Sebastian Faubel wrote:

Hello everyone,

I am using Jena Fuseki 4.6.1 with a dataset that I want to validate using
SHACL. I've seen the documentation on the SHACL feature in Apache Jena
Fuseki here:

https://jena.apache.org/documentation/shacl/

My issue is that my SHACL shapes graph has around 300mb. Uploading this
every time I want to validate would be pretty slow and inefficient. I was
wondering if it is possible to persist the shapes graph in the dataset
somehow?

Thank you! :)

~Sebastian

*Semiodesk GmbH | *Werner-von-Siemens-Str. 6 Geb. 15k, 86159 Augsburg,
Germany | Phone: +49 821 8854401 | Fax: +49 821 8854410 | 
www.semiodesk.com



This e-mail message may contain confidential or legally privileged
information and is intended only for the use of the intended 
recipient(s).

Any unauthorized disclosure, dissemination, distribution, copying or the
taking of any action in reliance on the information herein is prohibited.
E-mails are not secure and cannot be guaranteed to be error free as they
can be intercepted, amended, or contain viruses. Anyone who communicates
with us by e-mail is deemed to have accepted these risks. Semiodesk 
GmbH is

not responsible for errors or omissions in this message and denies any
responsibility for any damage arising from the use of e-mail. Any opinion
and other statement contained in this message and any attachment are 
solely
those of the author and do not necessarily represent those of the 
company.




Counting all triples performance and named graphs

2022-09-21 Thread Simon Bin
Hi,

we have a data set with 500 million triples. Single named graph, fuseki
tdb2. 

We observe different performance between

select (count(*) as ?cnt) {
  ?s ?p ?o
}

~4 minutes

select (count(*) as ?cnt) {
  graph  { ?s ?p ?o }
}

~3 minutes

select (count(*) as ?cnt)
from  {
  ?s ?p ?o
}

takes forever and longer...? 

Especially the last case is surprising, any thoughts?


Re: Persist SHACL shapes in dataset possible?

2022-09-21 Thread Lorenz Buehmann
Interesting question. I think currently the Fuseki SHACL service expects 
a Turtle file for the shapes acording to [1]. I agree that this seems to 
be rather inefficient.


@Andy:

- the code expects only Turtle format, right? RDF/XML would fail to 
parse the shapes then, correct?


- the shapes are parsed into a Graph, wouldn't it be possible to reuse a 
named graph of the backend dataset containing the shapes? Or would it be 
too slow to use shaped from e.g. a TDB backend?



Cheers,

Lorenz


[1] 
https://github.com/apache/jena/blob/main/jena-fuseki2/jena-fuseki-core/src/main/java/org/apache/jena/fuseki/servlets/SHACL_Validation.java#L66


On 20.09.22 17:22, Sebastian Faubel wrote:

Hello everyone,

I am using Jena Fuseki 4.6.1 with a dataset that I want to validate using
SHACL. I've seen the documentation on the SHACL feature in Apache Jena
Fuseki here:

https://jena.apache.org/documentation/shacl/

My issue is that my SHACL shapes graph has around 300mb. Uploading this
every time I want to validate would be pretty slow and inefficient. I was
wondering if it is possible to persist the shapes graph in the dataset
somehow?

Thank you! :)

~Sebastian

*Semiodesk GmbH | *Werner-von-Siemens-Str. 6 Geb. 15k, 86159 Augsburg,
Germany | Phone: +49 821 8854401 | Fax: +49 821 8854410 | www.semiodesk.com


This e-mail message may contain confidential or legally privileged
information and is intended only for the use of the intended recipient(s).
Any unauthorized disclosure, dissemination, distribution, copying or the
taking of any action in reliance on the information herein is prohibited.
E-mails are not secure and cannot be guaranteed to be error free as they
can be intercepted, amended, or contain viruses. Anyone who communicates
with us by e-mail is deemed to have accepted these risks. Semiodesk GmbH is
not responsible for errors or omissions in this message and denies any
responsibility for any damage arising from the use of e-mail. Any opinion
and other statement contained in this message and any attachment are solely
those of the author and do not necessarily represent those of the company.