Re: Aborting UPDATEs

2024-05-19 Thread Andy Seaborne




On 17/05/2024 17:22, Holger Knublauch wrote:

Hi all,

am I missing something obvious or is it not yet possible to programmatically 
abort SPARQL UPDATEs, like QueryExecutions can?


No, you aren't missing anything. Updates don't have a timeout.

Probably they could have nowadays.

There is a requirement on the dataset target of the update - it must 
support a proper abort.


An Update can be several operations, separated by ";"  and each 
operation is completed before the next is attempted.


A timeout ought to abort the whole update otherwise, at best, the data 
is partially updated, at worse (non-transactional usage), java data 
structures may be corrupted.


Most datasets support a good enough abort but not a general dataset with 
links to arbitrary graphs. A buffering dataset can be used for a good 
enough abort.




Related, I also didn't see a way to set a timeout.

I guess for our use cases it would be sufficient if the abort would happen 
during the WHERE clause iteration...

>

Thanks
Holger



Andy


Re: Fuseki: multiple instances vs. multiple datasets

2024-05-13 Thread Andy Seaborne




On 13/05/2024 11:10, Martynas Jusevičius wrote:

Hi,

I'm using multiple Fuseki instances in a Docker setup but considering
to use a single instance with multiple datasets instead.

So I was wondering what the differences of those setups are (besides
the lower memory consumption etc.) 


which are not so great because the dominant memory cost is the database.


in terms of:
- security - I suppose there would be no difference since the datasets
are isolated and have separate endpoints?


If the docker setup is multi-machine, there is isolation of 
denial-of-service issues.



- federation - would SPARQL federation perform better on a single
instance? E.g. if a query federates between datasets on the same
instance, maybe Fuseki would recognize that and avoid HTTP calls? Just
thinking out loud here.
- any other aspects?


Administration convenience - which could go either way.

Load balancing.



Martynas



Andy


Re: Cannot get Fuseki 5 to run...

2024-05-02 Thread Andy Seaborne

Hi Phil,

It's a bug.

Fuseki uses the CORS filter from Eclipse Jetty by code-copy so as not to 
depend on Jetty. But at the last update, some Jetty code usage didn't 
get replaced and there are class references.


Issue created:
https://github.com/apache/jena/issues/2443

Andy

On 02/05/2024 04:02, Phillip Rhodes wrote:

Gang:

I'm having NO luck at all getting Fuseki 5 to run. I'm using Java 17
and the latest Tomcat 10 release that I see (apache-tomcat-10.1.23)
and Fuseki "jena-fuseki-war-5.0.0.war". From what I could find of docs
I thought this combination was sufficient, but apparently not. When I
try to launch the server I get this:

02-May-2024 02:56:46.903 SEVERE [main]
org.apache.catalina.startup.HostConfig.deployWAR Error deploying web
application archive
[/extradata/downloads/tomcat/apache-tomcat-10.1.23/webapps/fuseki.war]
java.lang.IllegalStateException: Error starting child
at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:690)
at
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:659)
at
org.apache.catalina.core.StandardHost.addChild(StandardHost.java:712)
at
org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:969)
at
org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1911)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
org.apache.tomcat.util.threads.InlineExecutorService.execute(InlineExecutorService.java:75)
at
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:123)
at
org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:771)
at
org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:423)
at
org.apache.catalina.startup.HostConfig.start(HostConfig.java:1629)
at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:303)
at
org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:114)
at
org.apache.catalina.util.LifecycleBase.setStateInternal(LifecycleBase.java:402)
at
org.apache.catalina.util.LifecycleBase.setState(LifecycleBase.java:345)
at
org.apache.catalina.core.ContainerBase.startInternal(ContainerBase.java:903)
at
org.apache.catalina.core.StandardHost.startInternal(StandardHost.java:845)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:171)
at
org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1345)
at
org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1335)
at
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
org.apache.tomcat.util.threads.InlineExecutorService.execute(InlineExecutorService.java:75)
at
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:145)
at
org.apache.catalina.core.ContainerBase.startInternal(ContainerBase.java:876)
at
org.apache.catalina.core.StandardEngine.startInternal(StandardEngine.java:240)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:171)
at
org.apache.catalina.core.StandardService.startInternal(StandardService.java:470)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:171)
at
org.apache.catalina.core.StandardServer.startInternal(StandardServer.java:947)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:171)
at org.apache.catalina.startup.Catalina.start(Catalina.java:757)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at
org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:345)
at 
org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:473)
Caused by: org.apache.catalina.LifecycleException: Failed to
start component
[StandardEngine[Catalina].StandardHost[localhost].StandardContext[/fuseki]]
at
org.apache.catalina.util.LifecycleBase.handleSubClassException(LifecycleBase.java:419)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:186)
at

TDB3

2024-04-25 Thread Andy Seaborne




On 24/04/2024 21:42, Martynas Jusevičius wrote:

Andy,

Not directly related, but would different storage backend address
issues like this?

It might sound a bit like the legacy SDB, but AFAIK oxigraph, Stardog
and another commercial triplestore use RocksDB for storage.
https://github.com/oxigraph/oxigraph
https://docs.stardog.com/operating-stardog/database-administration/storage-optimize

There is even a RocksDB backend for Jena:
https://github.com/zourzouvillys/triplerocks
And just now I found your own TDB3 repo: https://github.com/afs/TDB3

Can you shed some light on TDB3 and this approach in general?


TDB3 uses RocksDB as the storage layer replacing the custom B+trees and 
also the node table. It's a naive use of RockDB. It seems to work 
(functional). It's untested both in code and deployment.


It loads slower than tdb2 bulk loaders (IIRC maybe 70Ktriples/s) but 
little work has been done to exploit Rocks capabilities.


The advantage of Rocks is that it is likely to be around for a long time 
(= it's a safe investment), it's transactional, has compression [1], has 
compaction [2], it has a java wrapper (separate, but closely related and 
in contact with the Rocks team).


While there are many storage engines that claim to be faster than 
RocksDB, often, such claims have assumptions.


There are other storage layers to explore as well.

Andy

[1] Better, or also, would probably be compression in the encoding of 
stored tuples.


[2] compaction has two parts - finding the RDF terms that are currently 
in use in the database and recovering space indexes. RocksDB compaction 
is about the second case.





Martynas

On Wed, Apr 24, 2024 at 10:30 PM Andy Seaborne  wrote:


Hi Balduin,

Thanks for the detailed report. It's useful to hear of the use case that
occur and also the behaviour of specific deployments.

On 22/04/2024 16:22, Balduin Landolt wrote:

Hello,

we're running Fuseki 5.0.0 (but previously the last 4.x versions behaved
essentially the same) with roughly 40 Mio triples (tendency: growing).
Not sure what configuration is relevant, but we have the default graph as
the union graph.


Sort of relevant.

There are more indexes on named graphs so there is more compaction work
to be done.

"union default graph" is a view at query time, not in the storage itself.


Also, we use Fuseki as our main database, not just as a "view on our data"
so we do quite a bit of updating on the data all the time.

Lately, we've been having more and more issues with servers running out of
disk space because Fuseki's database grew pretty rapidly.
This can be solved by compacting the DB, but with our data and hardware
this takes ca. 15 minutes, during which Fuseki does not accept any update
queries, so for the production system we can't really do this outside of
nighttime hours when (hopefully) no one uses the system anyways.


Is the database disk area on an SSD, on a hard disk, or a remote
filesystem (and then, is it SSD or hard disk)?


Some things we've noticed:
- A subset of our data (I think ~20 Mio triples) taking up 6GB in compacted
state, when dumped to a .trig file is ca. 5GB. But when uploading the same
.trig file to an empty DB, this grows to ca. 25GB
- Dropping graphs does not free up disk space


That's at the point the graph is dropped? It should reclaim space at
compaction.


- A sequence of e.g. 10k queries updating only a small number of triples
(maybe 1-10 or so) on the full dataset seems to grow the DB size a lot,
like 10s to 100s of GB (I don't have numbers on this one, but it was
substantial).


This might be a factor. There is a space overhead per transaction, not
solely due to the size of update. Sounds like 10k updates is makiing
that appreciably.

Are the updates all additions? Or a mix of additions and deletions?


My question is:



Would that kind of growth in disk usage be expected?


Given 10K updates, then what you describe sounds possible.

  > Are other people having similar issues?> Are there strategies to
mitigate this?
Batching the updates although this does mean the updates don't
immediately appear in the database.

This can work reasonable when the updates are additions. If there are
deletes, it's harder.


Maybe some configuration that may be tweaked or so?


Sorry - there aren't any controls.



Best & thanks in advance,
Balduin



  Andy


Re: rdf:parseType="literal" vs. rdf:datatype="...XMLLiteral"

2024-04-25 Thread Andy Seaborne




On 25/04/2024 07:58, Thomas Francart wrote:

Hello Andy

Le lun. 22 avr. 2024 à 21:03, Andy Seaborne  a écrit :



On 22/04/2024 08:02, Thomas Francart wrote:

Hello

This is 3.17.0. Pretty old, due to other dependency with TopQuadrant

SHACL

API.


It's not perfect in 5.0.0 either.

TopQuadrant SHACL is now 4.10.0. It would be good to upgrade because of
XML security issue fixes around Jena 4.3.2.


It is being rejected because it is not legal RDF 1.0. At 1.0, the
lexical space had restrictions (XML exclusive canonicalization) where
 is not allowed. It has to be  -- there are
various other rules as well.



Thank you, I wasn't aware of this.


Nor was I until I checked!

The RDF 1.0 rules are quite confusing and this comes up from time to 
time. At the time there was a DOM so no way to have a defined value 
space other than strings and restrict the lexical form by exclusive 
canonicalization.


(The DOM wasn't standardized at the time of RDF 1.0 IIRC)

The definition of rdf:XMLLiteral changed at RDF 1.1 to one where any XML
document fragment string is valid.

Seems not all places got updated. Partially, that is because it was
depending on the specific implementation of the Jena RDF/XML parser.

https://github.com/apache/jena/issues/2430


Do you happen to have the SPARQL queries? That part of your report is
related to the value space of RDF XML Literals.



Yes, the query is using the "=" operator :


OK - that will get fixed with

https://github.com/apache/jena/issues/2430



ask {
   ?uri a <http://exemple.com/MyClass> .
   ?uri <http://exemple.com/MyProperty> ?x, ?y.
   filter (?x != ?y)
}


This false because use in the filter requires the value and the value is 
undefined.




But then using the sameTerm function we don't get the error:

ask {
   ?uri a <http://exemple.com/MyClass> .
   ?uri <http://exemple.com/MyProperty> ?x, ?y.
   FILTER ( !sameTerm(?x, ?y) )
}




A proper update to RDF 1.1 may change the value object class (it is
"string" for RDF 1.0, it is, by the spec, DocumentFragment for RDF 1.1;
it could be kept at document fragment toString() in jena. I'd like to
understand the usage to see which change is best).

  Andy

BTW It's rdf:parseType="Literal" -- Jena 5.0.0 is not tolerant of lower
case "literal"


And that can be put back to be tolerant on input.





Thanks !

Thomas


Andy


Re: Java 21 support for Jena Fuseki 5.0.0

2024-04-24 Thread Andy Seaborne

The wording has been changed to
"Jena5 requires Java 17, or a later version of Java."

Thanks
Andy

On 24/04/2024 09:45, Balduin Landolt wrote:

Hi list,

me again... Does Jena Fuseki 5.0.0 support Java 21?
On https://jena.apache.org/download/ all I can see is "Jena5 requires Java
17".

Best,
Balduin



Re: Fuseki growing in size and need for compaction

2024-04-24 Thread Andy Seaborne

Hi Balduin,

Thanks for the detailed report. It's useful to hear of the use case that 
occur and also the behaviour of specific deployments.


On 22/04/2024 16:22, Balduin Landolt wrote:

Hello,

we're running Fuseki 5.0.0 (but previously the last 4.x versions behaved
essentially the same) with roughly 40 Mio triples (tendency: growing).
Not sure what configuration is relevant, but we have the default graph as
the union graph.


Sort of relevant.

There are more indexes on named graphs so there is more compaction work 
to be done.


"union default graph" is a view at query time, not in the storage itself.


Also, we use Fuseki as our main database, not just as a "view on our data"
so we do quite a bit of updating on the data all the time.

Lately, we've been having more and more issues with servers running out of
disk space because Fuseki's database grew pretty rapidly.
This can be solved by compacting the DB, but with our data and hardware
this takes ca. 15 minutes, during which Fuseki does not accept any update
queries, so for the production system we can't really do this outside of
nighttime hours when (hopefully) no one uses the system anyways.


Is the database disk area on an SSD, on a hard disk, or a remote 
filesystem (and then, is it SSD or hard disk)?



Some things we've noticed:
- A subset of our data (I think ~20 Mio triples) taking up 6GB in compacted
state, when dumped to a .trig file is ca. 5GB. But when uploading the same
.trig file to an empty DB, this grows to ca. 25GB
- Dropping graphs does not free up disk space


That's at the point the graph is dropped? It should reclaim space at 
compaction.



- A sequence of e.g. 10k queries updating only a small number of triples
(maybe 1-10 or so) on the full dataset seems to grow the DB size a lot,
like 10s to 100s of GB (I don't have numbers on this one, but it was
substantial).


This might be a factor. There is a space overhead per transaction, not 
solely due to the size of update. Sounds like 10k updates is makiing 
that appreciably.


Are the updates all additions? Or a mix of additions and deletions?


My question is:


Would that kind of growth in disk usage be expected? 


Given 10K updates, then what you describe sounds possible.

> Are other people having similar issues?> Are there strategies to 
mitigate this?
Batching the updates although this does mean the updates don't 
immediately appear in the database.


This can work reasonable when the updates are additions. If there are 
deletes, it's harder.



Maybe some configuration that may be tweaked or so?


Sorry - there aren't any controls.



Best & thanks in advance,
Balduin



Andy


Re: Java 21 support for Jena Fuseki 5.0.0

2024-04-24 Thread Andy Seaborne




On 24/04/2024 10:41, Rob @ DNR wrote:

Java versions are generally forwards compatible, so Fuseki should run fine on 
Java 21, unless any of our dependencies have some previously unreported issues 
with Java 21

If you do find any bugs then please file bugs as appropriate

Thanks,

Rob


The project has CI with Java21 (targeting java17 byte code) and Java latest.

https://ci-builds.apache.org/job/Jena/

Currently, the Java23-latest build breaks because of:

1. Removal of javascript name "js"
2. org.awaitility not liking "23-ea" as the JDK version number
3. (jena-permissions) mockito -> Byte Buddy - "Java 23 not supported".

(2) and (3) will "just happen".

Andy


From: Balduin Landolt 
Date: Wednesday, 24 April 2024 at 09:46
To: users@jena.apache.org 
Subject: Java 21 support for Jena Fuseki 5.0.0
Hi list,

me again... Does Jena Fuseki 5.0.0 support Java 21?
On https://jena.apache.org/download/ all I can see is "Jena5 requires Java
17".

Best,
Balduin



Re: rdf:parseType="literal" vs. rdf:datatype="...XMLLiteral"

2024-04-22 Thread Andy Seaborne



On 22/04/2024 08:02, Thomas Francart wrote:

Hello

This is 3.17.0. Pretty old, due to other dependency with TopQuadrant SHACL
API.


It's not perfect in 5.0.0 either.

TopQuadrant SHACL is now 4.10.0. It would be good to upgrade because of 
XML security issue fixes around Jena 4.3.2.



It is being rejected because it is not legal RDF 1.0. At 1.0, the 
lexical space had restrictions (XML exclusive canonicalization) where 
 is not allowed. It has to be  -- there are 
various other rules as well.


(The DOM wasn't standardized at the time of RDF 1.0 IIRC)

The definition of rdf:XMLLiteral changed at RDF 1.1 to one where any XML 
document fragment string is valid.


Seems not all places got updated. Partially, that is because it was 
depending on the specific implementation of the Jena RDF/XML parser.


https://github.com/apache/jena/issues/2430


Do you happen to have the SPARQL queries? That part of your report is 
related to the value space of RDF XML Literals.


A proper update to RDF 1.1 may change the value object class (it is 
"string" for RDF 1.0, it is, by the spec, DocumentFragment for RDF 1.1; 
it could be kept at document fragment toString() in jena. I'd like to 
understand the usage to see which change is best).


Andy

BTW It's rdf:parseType="Literal" -- Jena 5.0.0 is not tolerant of lower 
case "literal"




Thomas

Le sam. 20 avr. 2024 à 18:06, Andy Seaborne  a écrit :


Hi Thomas,

Which version of Jena is this?

  Andy

On 19/04/2024 17:18, Thomas Francart wrote:

Hello

The RDF/XML parsing of the following succeeds:




href="

https://xx.xx.xx/PC"/>



while the RDF/XML parsing of this gives an error : in that case the XML

has

simply be encoded with , and  and the rdf:datatype has been
explicitly set to XMLLiteral :



http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral

">amreferencesymbol

href="https://xx.xx.xx/PC
"//reference/am




The error is

13:08:04.742 WARN  org.apache.jena.riot - Lexical form
'https://xx.xx.xx/PC"/>'

not

valid for datatype XSD XMLLiteral

and then further down in SPARQL queries:

13:08:04.775 WARN  o.apache.jena.sparql.expr.NodeValue - Datatype format
exception: "https://xx.xx.xx/PC\
"/>"^^rdf:XMLLiteral

The encoded XML is however valid.

Is it possible to explicitely create literals with XMLLiteral datatype in
RDF/XML by setting this datatype explicitely ?

Thanks
Thomas









Re: rdf:parseType="literal" vs. rdf:datatype="...XMLLiteral"

2024-04-20 Thread Andy Seaborne

Hi Thomas,

Which version of Jena is this?

Andy

On 19/04/2024 17:18, Thomas Francart wrote:

Hello

The RDF/XML parsing of the following succeeds:



https://xx.xx.xx/PC"/>



while the RDF/XML parsing of this gives an error : in that case the XML has
simply be encoded with , and  and the rdf:datatype has been
explicitly set to XMLLiteral :



http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral;>amreferencesymbol
href="https://xx.xx.xx/PC
"//reference/am




The error is

13:08:04.742 WARN  org.apache.jena.riot - Lexical form
'https://xx.xx.xx/PC"/>' not
valid for datatype XSD XMLLiteral

and then further down in SPARQL queries:

13:08:04.775 WARN  o.apache.jena.sparql.expr.NodeValue - Datatype format
exception: "https://xx.xx.xx/PC\
"/>"^^rdf:XMLLiteral

The encoded XML is however valid.

Is it possible to explicitely create literals with XMLLiteral datatype in
RDF/XML by setting this datatype explicitely ?

Thanks
Thomas




Re: ModelExtract

2024-04-11 Thread Andy Seaborne

Hi Arne, hi Simon,

It got removed because there wasn't evidence of use. 5.x.x was a chance

It is opinionated and it doesn't feel like a good fit being in the 
central code for graph. It's more like a utility library feature.


It can come back, maybe in a better form or better location.

So to both of you - what are your use cases? What are the 
TripleBoundary/StatementBoundary in use?


Andy

On 09/04/2024 19:56, Arne Bernhardt wrote:

Hi Simon,

my colleagues had the same problem with GraphExtract.
The code has been removed in the context of Jena5 Model API changes
 in the commit
https://github.com/afs/jena/commit/6697a516724745532616bb0db3ce67a8778e2b6c.
So anyone may fetch the latest code from there.
Unfortunately, I am not sure why exactly it has been removed. Since it does
not implement any standard and is not documented under
https://jena.apache.org/documentation/,  I doubt it will find its way back
into Jena.

Greetings
Arne

Am Di., 9. Apr. 2024 um 19:54 Uhr schrieb Dutkowski, Simon <
simon.dutkow...@fokus.fraunhofer.de>:


Hi All

I realized that in version 5.0.0, the classes ModelExtract and co are
removed. Are there any replacements or other ways to achieve the same (or
similar) results?

I can easily fetch the classes from earlier versions and integrate them
into my project directly, but I am not sure if it is necessary, and if
possible I would prefer to avoid it.

Thanks in advance
 Simon

--
Dipl.-Inf. Simon Dutkowski
Fraunhofer FOKUS (DPS)
Kaiserin-Augusta-Allee 31, 10589 Berlin
+49 160 90112644






Re: Performance question with joins

2024-04-01 Thread Andy Seaborne

Hi John,

Yes, the join of two large subqueries is the issue.

Optimization involves making pragmatic determination. Sometimes it isn't 
optimal for some data.


Something to consider is detecting these independency of the (?X_i, 
?X_j) and (?Y_i, ?Y_j) blocks because ia hash join is likely a better 
choice. That, or caching part evaluations there is cross-product like 
effects.


Thank you for the details.

Andy

See also:

"A Worst-Case Optimal Join Algorithm for SPARQL"
https://aidanhogan.com/docs/SPARQL_worst_case_optimal.pdf

"Leapfrog Triejoin: A Simple, Worst-Case Optimal Join"
Algorithm
https://www.openproceedings.org/2014/conf/icdt/Veldhuizen14.pdf

On 29/03/2024 09:25, John Walker wrote:

I did some more experimentation and checked the query algebra using the 
--explain option.

For sake of simplicity I use a simpler query:

```
select (count(*) as ?C)
where {
   {
 select ?X ?Y (struuid() as ?UUID)
 where {
   values ?X { 0 1 }
   values ?Y { 0 1 }
 }
   }
   {
 select ?X ?Y
 where {
   {
 select ?X ?Y (rand() as ?RAND)
 where {
   values ?X { 0 1 }
   values ?Y { 0 1 }
 }
   }
   filter (?RAND < 0.95)
 }
   }
}
```

For this the algebra is:

```
   (project (?C)
 (extend ((?C ?.0))
   (group () ((?.0 (count)))
 (sequence
   (project (?X ?Y ?UUID)
 (extend ((?UUID (struuid)))
   (sequence
 (table (vars ?Y)
   (row [?Y 0])
   (row [?Y 1])
 )
 (table (vars ?X)
   (row [?X 0])
   (row [?X 1])
 
   (project (?X ?Y)
 (project (?X ?Y ?/RAND)
   (filter (< ?/RAND 0.95)
 (extend ((?/RAND (rand)))
   (sequence
 (table (vars ?Y)
   (row [?Y 0])
   (row [?Y 1])
 )
 (table (vars ?X)
   (row [?X 0])
   (row [?X 1])
 ))
```

Whilst if I make a small change to also project some other variable from the 
second subquery

```
select (count(*) as ?C)
where {
   {
 select ?X ?Y (struuid() as ?UUID)
 where {
   values ?X { 0 1 }
   values ?Y { 0 1 }
 }
   }
   {
 select ?X ?Y (0 as ?_)
 where {
   {
 select ?X ?Y (rand() as ?RAND)
 where {
   values ?X { 0 1 }
   values ?Y { 0 1 }
 }
   }
   filter (?RAND < 0.95)
 }
   }
}
```

Then the algebra is:

```
   (project (?C)
 (extend ((?C ?.0))
   (group () ((?.0 (count)))
 (join
   (project (?X ?Y ?UUID)
 (extend ((?UUID (struuid)))
   (sequence
 (table (vars ?Y)
   (row [?Y 0])
   (row [?Y 1])
 )
 (table (vars ?X)
   (row [?X 0])
   (row [?X 1])
 
   (project (?X ?Y ?_)
 (extend ((?_ 0))
   (project (?X ?Y ?/RAND)
 (filter (< ?/RAND 0.95)
   (extend ((?/RAND (rand)))
 (sequence
   (table (vars ?Y)
 (row [?Y 0])
 (row [?Y 1])
   )
   (table (vars ?X)
 (row [?X 0])
 (row [?X 1])
   )))
```

Note the outermost sequence operator has changed to a join operator.
I don’t understand the logic behind that.

Note that projecting the ?RAND variable from the second query does not force 
the join.

John


-Original Message-
From: John Walker 
Sent: Friday, 29 March 2024 08:55
To: users@jena.apache.org
Subject: RE: Performance question with joins

I did a bit more experimentation by putting the second subquery inside some
other clauses:

* FILTER EXISTS - no effect
* OPTIONAL - runtime around 0.5s
* MINUS - runtime around 0.5s

So, I assume that the engine is doing some form of nested loop join to iterate
through each solution from the first subquery and test against the second.
Same as what is happening with FILTER EXISTS.

A "hack" to get around this seems to be to add a redundant MINUS {}
between the subqueries.

John


-Original Message-
From: John Walker 
Sent: Friday, 29 March 2024 07:58
To: jena-users-ml 
Subject: Performance question with joins

Hi,

I am working with some data representing a 2D Cartesian coordinate
system representing simple grid array “maps”
The X and Y coordinates are represented as integers.

I want to join data from different “layers” in the data.
One layer contains a unique identifier for each position.
The other layer only contains a subset of coordinates.

I have written the following queries to simulate some data to

Re: Requesting advice on Fuseki memory settings

2024-03-25 Thread Andy Seaborne




On 25/03/2024 07:05, Gaspar Bartalus wrote:

Dear Andy and co.,

Thanks for the support, I think we can close this thread for now.
We will continue to monitor this behaviour and if we can retrieve any
additional useful information then we might reopen it.


Please do pass on any information and techniques for operation 
Fuseki/TDB. There is so much variety "out there" that all reports are 
helpful.


Andy



Best regards,
Gaspar

On Sun, Mar 24, 2024 at 5:00 PM Andy Seaborne  wrote:




On 21/03/2024 09:52, Rob @ DNR wrote:

Gaspar

This probably relates to https://access.redhat.com/solutions/2316

Deleting a file removes it from the file table but doesn’t immediately

free the space if a process is still accessing those files.  That could be
something else inside the container, or in a containerised environment
where the disk space is mounted that could potentially be host processes on
the K8S node that are monitoring the storage.
  >

There’s some suggested debugging steps in the RedHat article about ways

to figure out what processes might still be holding onto the old database
files


Rob


Fuseki does close the database connections after compact but only after
all read transactions on the old database have completed. that can hold
the database open for a while.

Another delay is the ext4 filing system. Deletes will be in the journal
and only when the journal operations are performed will the file system
be released. Usually this happens quickly, but I've seen it take an
appreciable length of time occasionally.

Gaspar wrote:
  > then we start fresh where du -sh and df -h return the same numbers.

This indicates the file space has been release. Restarting clears any
outstanding read-transactions and likely gives the ext4 journal to run
through.

Just about any layer (K8s, VMs) adds delays to real release of the space
but it should happen eventually.

  Andy


From: Gaspar Bartalus 
Date: Wednesday, 20 March 2024 at 11:41
To: users@jena.apache.org 
Subject: Re: Requesting advice on Fuseki memory settings
Hi Andy

On Sat, Mar 16, 2024 at 8:58 PM Andy Seaborne  wrote:




On 12/03/2024 13:17, Gaspar Bartalus wrote:

On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:


On 11/03/2024 14:35, Gaspar Bartalus wrote:

Hi Andy,

On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne

wrote:




On 08/03/2024 10:40, Gaspar Bartalus wrote:

Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences

between

the actual size of our dataset and the size it uses on disk.

(Changes

between `df -h` and `du -lh`)?

Linear increase between compactions or across compactions? The

latter

sounds like the previous version hasn't been deleted.


Across compactions, increasing linearly over several days, with

compactions

running every day. The compaction is used with the "deleteOld"

parameter,

and there is only one Data- folder in the volume, so I assume

compaction

itself works as expected.



Strange - I can't explain that. Could you check that there is only one
Data- directory inside the database directory?


Yes, there is surely just one Data- folder in the database

directory.



What's the disk storage setup? e.g filesystem type.


We have an Azure disk of type Standard SSD LRS with a filesystem of

type

Ext4.


Hi Gaspar,

I still can't explain what your seeing I'm afraid.

Can we get some more details?

When the server has Data-N -- how big (as reported by 'du -sh') is that
directory and how big is the whole directory for the database. They
should be nearly equal.




When a compaction is done, and the server is at Data-(N+1), what are the
sizes of Data-(N+1) and the database directory?



What we see with respect to compaction is usually the following:
- We start with the Data-N folder of ~210MB
- After compaction we have a Data-(N+1) folder of size ~185MB, the old
Data-N being deleted.
- The sizes of the database directory and the Data-* directory are equal.

However when we check with df -h we sometimes see that volume usage is

not

dropping, but on the contrary, it goes up ~140MB after each compaction.



Does stop/starting the server change those numbers?



Yes, then we start fresh where du -sh and df -h return the same numbers.



   Andy







Re: query performance on named graph vs. default graph

2024-03-24 Thread Andy Seaborne




On 21/03/2024 00:21, Jim Balhoff wrote:

Hi Lorenz,

These both do speed things up quite a bit, but it prevents matching patterns 
that cross graphs in the case where I include multiple graphs.

Thanks,
Jim


It is the combination choosing certain graphs and wanting cross graph 
patterns that pushes the code into working in general way. it works in 
Nodes, and that means string comparisons.  That looses the TDB ability 
to do faster joins using NodeIds which both avoids string comparisons 
and retrieving the strings until they are known to be needed for the 
results.


Is there a reason for not having a union default graph overall the named 
graphs instead of selecting certain ones? If it is all named graphs, the 
union is TDB2 level.


You can have a Fuseki setup with two endpoints - one that does union 
default graph, one that does not, for the same dataset.


Andy





On Mar 20, 2024, at 4:28 AM, Lorenz Buehmann 
 wrote:

Hi,

what about

SELECT *
FROM NAMED 
FROM NAMED 
FROM NAMED  ...
FROM NAMED 
{
   GRAPH ?g {
   ...
   }
}

or

SELECT *
{
  VALUES ?g {  ... }
   GRAPH ?g {
 ...
   }
}


does that work better?

On 19.03.24 15:21, Jim Balhoff wrote:

Hi Andy,


On Mar 19, 2024, at 5:02 AM, Andy Seaborne  wrote:
Hi Jim,

What happens if you use GRAPH rather than FROM?

WHERE {
   GRAPH <http://example.org/ubergraph> {
 ?cell rdfs:subClassOf cell: .
 ?cell part_of: ?organ .
 ?organ rdfs:subClassOf organ: .
 ?organ part_of: abdomen: .
 ?cell rdfs:label ?cell_label .
 ?organ rdfs:label ?organ_label .
   }
}


This does help. With TDB this is actually faster than using the default graph. 
With the HDT setup it’s about the same (fast). But it doesn’t work that well 
for what I’m trying to do (below).


FROM builds a "view dataset" which is general purpose (e.g. multiple FROM are 
possible) but which is less efficient for basic graph pattern matching. It does not use 
the TDB2 basic graph pattern matcher.

GRAPH restricts to a single graph and the query goes direct to TDB2 basic graph 
pattern matcher.



If there is only one name graph, is here a reason to have it as a named graph? 
Using the default graph and no unionDefaultGraph may be

What I am really trying to do is have suite of large graphs that I can choose 
to include or not in a particular query, depending on what data sources I want 
to use in the query. I have several HDT files, one for each data source. I set 
this up as a dataset with a named graph for each data file, and was at first 
very happy with how it performed while turning on and off graphs using FROM 
lines. For example I have Wikidata in one HDT file, and it looks like having it 
available doesn’t slow down queries on other graphs when it’s not included. 
However I did see that performance issue in the query I asked about, and found 
it wasn’t related to having multiple graphs loaded; it happens even with just 
that one graph configured.

If I wrote my own server that accepted a list of data source names in a query 
parameter, and then for each request constructed a union model for executing 
the query over the required HDT graphs, would that work any better? Or is that 
basically the same as what FROM is doing?

Thank you,
Jim



--
Lorenz Bühmann
Research Associate/Scientific Developer

Email buehm...@infai.org

Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109 
Leipzig | Germany





Re: Requesting advice on Fuseki memory settings

2024-03-24 Thread Andy Seaborne




On 21/03/2024 09:52, Rob @ DNR wrote:

Gaspar

This probably relates to https://access.redhat.com/solutions/2316

Deleting a file removes it from the file table but doesn’t immediately free the 
space if a process is still accessing those files.  That could be something 
else inside the container, or in a containerised environment where the disk 
space is mounted that could potentially be host processes on the K8S node that 
are monitoring the storage.

>

There’s some suggested debugging steps in the RedHat article about ways to 
figure out what processes might still be holding onto the old database files

Rob


Fuseki does close the database connections after compact but only after 
all read transactions on the old database have completed. that can hold 
the database open for a while.


Another delay is the ext4 filing system. Deletes will be in the journal 
and only when the journal operations are performed will the file system 
be released. Usually this happens quickly, but I've seen it take an 
appreciable length of time occasionally.


Gaspar wrote:
> then we start fresh where du -sh and df -h return the same numbers.

This indicates the file space has been release. Restarting clears any 
outstanding read-transactions and likely gives the ext4 journal to run 
through.


Just about any layer (K8s, VMs) adds delays to real release of the space 
but it should happen eventually.


Andy


From: Gaspar Bartalus 
Date: Wednesday, 20 March 2024 at 11:41
To: users@jena.apache.org 
Subject: Re: Requesting advice on Fuseki memory settings
Hi Andy

On Sat, Mar 16, 2024 at 8:58 PM Andy Seaborne  wrote:




On 12/03/2024 13:17, Gaspar Bartalus wrote:

On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:


On 11/03/2024 14:35, Gaspar Bartalus wrote:

Hi Andy,

On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:



On 08/03/2024 10:40, Gaspar Bartalus wrote:

Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences

between

the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?

Linear increase between compactions or across compactions? The latter
sounds like the previous version hasn't been deleted.


Across compactions, increasing linearly over several days, with

compactions

running every day. The compaction is used with the "deleteOld"

parameter,

and there is only one Data- folder in the volume, so I assume

compaction

itself works as expected.



Strange - I can't explain that. Could you check that there is only one
Data- directory inside the database directory?


Yes, there is surely just one Data- folder in the database directory.


What's the disk storage setup? e.g filesystem type.


We have an Azure disk of type Standard SSD LRS with a filesystem of type
Ext4.


Hi Gaspar,

I still can't explain what your seeing I'm afraid.

Can we get some more details?

When the server has Data-N -- how big (as reported by 'du -sh') is that
directory and how big is the whole directory for the database. They
should be nearly equal.




When a compaction is done, and the server is at Data-(N+1), what are the
sizes of Data-(N+1) and the database directory?



What we see with respect to compaction is usually the following:
- We start with the Data-N folder of ~210MB
- After compaction we have a Data-(N+1) folder of size ~185MB, the old
Data-N being deleted.
- The sizes of the database directory and the Data-* directory are equal.

However when we check with df -h we sometimes see that volume usage is not
dropping, but on the contrary, it goes up ~140MB after each compaction.



Does stop/starting the server change those numbers?



Yes, then we start fresh where du -sh and df -h return the same numbers.



  Andy



Re: [ANN] Apache Jena 5.0.0

2024-03-21 Thread Andy Seaborne




On 20/03/2024 17:18, Arne Bernhardt wrote:

Hi Ryan,

there is no "term graph" to be found via Google. From Jena 5.0 on, the
default in-memory Graph in Jena will treat typed literals everywhere as
described under "literals term equality" in
https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal.

Before Jena 5, the default in-memory graph "indexed" object nodes based on
their values for typed literals, and methods like Graph#find and
Graph#contains found matches based on the values.

As far as I know, Fuseki always evaluated SPARQL with the
standard-compliant literal term equality.
But if one executed a query via the query API on the Jena 4 in-memory
graphs, the query execution would use object value equality.

I hope my explanation was roughly correct and helpful.

Arne


Hi Ryan,

In RDF, a literal looks like

"1"^^xsd:int

It is one of the kinds of RDF term

https://www.w3.org/TR/rdf11-concepts/#section-rdf-graph

"1" is lexical form.
xsd:int is the datatype.

The datatype xsd:int determines how these are mapped to values.
"+1", "0001" and "1" all map to the value one.

Two literal terms are the same term if and only if they have the same 
lexical form and same datatype (and language tag).


"+1"^^xsd:int has a different lexical form to "1"^^xsd:int so it is a 
different RDF term, yet they represent the same value.


In SPARQL,
   SAMETERM("1"^^xsd:int, "+1"^^xsd:int) is false.
   "1"^^xsd:int = "+1"^^xsd:int  is true.

Some Jena models stored literal by value.
RDF and SPARQL are defined to work with a graph made our of RDF terms, 
not values.


A "term graph" is one where Graph.find(,,1) or Model.listStatements() 
only considers RDF terms.


A "value graph" is one where looking for the literal "1"^^xsd:int might 
find "+1"^^xsd:int.



The change shouldn't have a widespread impact but it could be visible.
XSD datatypes define a canonical form - the preferred way to write a 
value. "1"^^xsd:int is canonical; "+1"^^xsd:int is not canonical.

Most published data uses canonical forms.

Andy


Shaw, Ryan  schrieb am Mi., 20. März 2024, 13:32:




On Mar 20, 2024, at 5:05 AM, Andy Seaborne  wrote:

** Term graphs

Graphs are now term graphs in the API or SPARQL. That is, they do not

match "same value" for some of the Java mapped datatypes. The model API
already normalizes values written.


TDB1, TDB2 keep their value canonicalization during data loading.

A legacy value-graph implementation can be obtained from GraphMemFactory.


Can someone point me to an explanation of what this means? I am not
familiar with the terminology of "term graph" and "value graph" and a quick
web search turns up nothing that looks relevant.







[ANN] Apache Jena 5.0.0

2024-03-20 Thread Andy Seaborne

The Apache Jena development community is pleased to
announce the release of Apache Jena 5.0.0.

In Jena5:

* Minimum Java requirement: Java 17

* Language tags are case-insensitive unique.

* Term graphs for in-memory models

* RRX - New RDF/XML parser

* Remove support for JSON-LD 1.0

* Turtle/Trig Output : default output PREFIX and BASE

* New artifacts : jena-bom and OWASP CycloneDX SBOM

* API deprecation removal

* Dependency updates :
Note: slf4j update : v1 to v2 (needs log4j change)

More details below.

 Contributions:

Configurable CORS headers for Fuseki
  From Paul Gallagher

Balduin Landolt @BalduinLandolt - javadoc fix for Literal.getString.

@OyvindLGjesdal - https://github.com/apache/jena/pull/2121 -- text index fix

Tong Wang @wang3820 Fix tests due to hashmap order

Explicit Accept headers on RDFConnectionRemote fix
  from @Aklakan



All issues in this release:
https://s.apache.org/jena-5.0.0-issues

which includes the ones specifically related to Jena5:

  https://github.com/apache/jena/issues?q=label%3Ajena5

** Java Requirement

Java 17 or later is required.
Java 17 language constructs now are used in the codebase.

Jakarta JavaEE required for deploying the WAR file (Apache Tomcat10)

** Language tags

Language tags become are case-insensitive unique.

"abc"@EN and "abc"@en are the same RDF term.

Internally, language tags are formatted using the algorithm of RFC 5646.

Examples "@en", "@en-GB", "@en-Latn-GB".

SPARQL LANG(?literal) will return a formatted language tag.

Data stored in TDB using language tags must be reloaded.

** Term graphs

Graphs are now term graphs in the API or SPARQL. That is, they do not 
match "same value" for some of the Java mapped datatypes. The model API 
already normalizes values written.


TDB1, TDB2 keep their value canonicalization during data loading.

A legacy value-graph implementation can be obtained from GraphMemFactory.

** RRX - New RDF/XML parser

RRX is the default RDF/XML parser. It is a replacement for ARP.
RIOT uses RRX.

The ARP parser is still temporarily available for transition assistance.

** Remove support for JSON-LD 1.0

JSON-LD 1.1, using Titanium-JSON-LD, is the supported version of JSON-LD.

https://github.com/filip26/titanium-json-ld

** Turtle/Trig Output

"PREFIX" and "BASE" are output by default for Turtle and TriG output.

** Artifacts

There is now a release BOM for Jena artifacts - artifact 
org.apache.jena:jena-bom


There are now OWASP CycloneDX SBOM for Jena artifacts.
https://github.com/CycloneDX

jena-tdb is renamed jena-tdb1.

jena-jdbc is no longer released

** Dependencies

The update to slf4j 2.x means the log4j artifact changes to
"log4j-slf4j2-impl" (was "log4j-slf4j-impl").


 API Users

** Deprecation removal

There has been a clearing out of deprecated functions, methods and 
classes. This includes the deprecations in Jena 4.10.0 added to show 
code that is being removed in Jena5.


** QueryExecutionFactory

QueryExecutionFactory is simplified to cover commons cases only; it 
becomes a way to call the general QueryExecution builders which are 
preferred and provide all full query execution setup controls.


Local execution builder:
QueryExecution.create()...

Remote execution builder:
QueryExecution.service(URL)...

** QueryExecution variable substitution

Using "substitution", where the query is modified by replacing one or 
more variables by RDF terms, is now preferred to using "initial 
bindings", where query solutions include (var,value) pairs.


"substitution" is available for all queries, local and remote, not just 
local executions.


Rename TDB1 packages org.apache.jena.tdb -> org.apache.jena.tdb1

 Fuseki Users

Fuseki: Uses the Jakarta namespace for servlets and Fuseki has been 
upgraded to use Eclipse Jetty12.


Apache Tomcat10 or later, is required for running the WAR file.
Tomcat 9 or earlier will not work.


== Obtaining Apache Jena 5.0.0

* Via central.maven.org

The main jars and their dependencies can used with:

  
org.apache.jena
apache-jena-libs
pom
5.0.0
  

Full details of all maven artifacts are described at:

http://jena.apache.org/download/maven.html

* As binary downloads

Apache Jena libraries are available as a binary distribution of
libraries. For details of a global mirror copy of Jena binaries please see:

http://jena.apache.org/download/

* Source code for the release

The signed source code of this release is available at:

http://www.apache.org/dist/jena/source/

and the signed master source for all Apache Jena releases is available
at: http://archive.apache.org/dist/jena/

== Contributing

If you would like to help out, a good place to look is the list of
unresolved JIRA at:

https://https://github.com/apache/jena/issuesissues-current

or review pull requests at

https://github.com/apache/jena/pulls

or drop into the dev@ list.

We use github pull requests and other ways for accepting code:
  

Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery

2024-03-19 Thread Andy Seaborne

Hi there,

Could you give some background as to what the sub-select / ORDER / LIMT 
blocks are trying to achieve? Maybe there is another way.


Andy

On 19/03/2024 10:50, Rob @ DNR wrote:

You haven’t specified how your data is stored but assuming you are using Jena’s 
TDB/TDB2 then the triples/quads themselves are already indexed for efficient 
access.  It also inlines some value types that speeds up some comparisons and 
filters, including those used in simple ORDER BY expression as in your example.

This assumes that your objects for relations:hasUserCount triples are properly 
typed as xsd:integer or another well-known XSD numeric type, if not Jena is 
forced to fallback to more simplistic lexical string sorting which can be more 
expensive.

However, there is no indexing available for sorting because SPARQL allows for 
arbitrarily complex sort expressions, and the inputs to those expressions may 
themselves be dynamically computed values that don’t exist in the underlying 
dataset directly.

Rob

From: Chirag Ratra 
Date: Tuesday, 19 March 2024 at 10:39
To: users@jena.apache.org , Andy Seaborne , 
dcchabg...@gmail.com 
Subject: Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery
Is there any way to create an index or something?

On Tue, Mar 19, 2024 at 3:46 PM Rob @ DNR  wrote:


This is due to Jena’s lazy evaluation in its query engine.

When you include a LIMIT clause on its own Jena only needs find the first
N results (10 in your example) at which point it can abort any further
processing and return results.  In this case evaluation is lazy.

When you include LIMIT and ORDER BY clauses Jena has to find all possible
results, sort them, and then return only the first N results.  In this case
full evaluation is required.

One possible approach might be to split into multiple queries i.e. do one
query to get your main set of results, and then separately issue the
related item sub-queries with concrete values substituted into for your
?concept and ?titleSkosXl values as while Jena will still need to do full
evaluation injecting a concrete value will constrain the query evaluation
further

Hope this helps,

Rob

From: Chirag Ratra 
Date: Tuesday, 19 March 2024 at 07:46
To: users@jena.apache.org 
Subject: Query Performance Degrade With Sorting In Subquery
Hi,

Facing a big performance degradation  while using sort query in subquery
If I run query without sorting the response of my query is around 200 ms
but when I use the order by query,  performance comes to be around 4-5
seconds.

Here is my query :

PREFIX text: <http://jena.apache.org/text#<http://jena.apache.org/text>>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#<
http://www.w3.org/2004/02/skos/core>><http://www.w3.org/2004/02/skos/core%3e%3e>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#<
http://www.w3.org/2008/05/skos-xl>><http://www.w3.org/2008/05/skos-xl%3e%3e>
PREFIX relations: <https://cxdata.bold.com/ontologies/myDomain#<
https://cxdata.bold.com/ontologies/myDomain>><https://cxdata.bold.com/ontologies/myDomain%3e%3e>

SELECT ?concept ?titleSkosxl ?title ?languageCode (GROUP_CONCAT(DISTINCT
?relatedTitle; separator=", ") AS ?relatedTitles) (GROUP_CONCAT(DISTINCT
?alternate; separator=", ") AS ?alternates)
WHERE
{
   (?titleSkosxl ?score) text:query ('cashier').

?concept skosxl:prefLabel ?titleSkosxl.
   ?titleSkosxl skosxl:literalForm ?title.
   ?titleSkosxl relations:usedInLocale ?controlledList.
   ?controlledList relations:languageMarketCode ?languageCode
FILTER(?languageCode = 'en-US').


#  get alternate title
OPTIONAL
   {
 Select ?alternate  {
 ?concept skosxl:altLabel ?alternateSkosxl.
 ?alternateSkosxl skosxl:literalForm ?alternate;
   relations:hasUserCount ?alternateUserCount.
 }
ORDER BY DESC (?alternateUserCount) LIMIT 10
}

#  get related titles
   OPTIONAL
   {
   Select ?relatedTitle
   {
 ?titleSkosxl relations:isRelatedTo ?relatedSkosxl.
 ?relatedSkosxl skosxl:literalForm ?relatedTitle;
 relations:hasUserCount ?relatedUserCount.
   }
ORDER BY DESC (?relatedUserCount) LIMIT 10
}
}
GROUP BY ?concept ?titleSkosxl ?title ?languageCode ?alternateJobTitle
?notation
ORDER BY DESC(?jobtitleWeight) DESC(?score)
LIMIT 10

The sorting queries given causes huge performance degradation :
ORDER BY DESC (?alternateUserCount) AND ORDER BY DESC (?relatedUserCount)

How can this be improved, this sorting will be used in each and every query
in my application.

--








This email may contain material that is confidential, privileged,
or for the sole use of the intended recipient.  Any review, disclosure,
reliance, or distribution by others or forwarding without express
permission is strictly prohibited.  If you are not the intended recipient,
please contact the sender and delete all copies, including attachments.



--








Re: query performance on named graph vs. default graph

2024-03-19 Thread Andy Seaborne




On 18/03/2024 17:46, Jim Balhoff wrote:

Hi,

I’m running a particular query in a Fuseki server which performs very 
differently if the data is in a named graph vs. the default graph. I’m 
wondering if it’s expected to have a large performance hit if a named graph is 
specified. The dataset consists of ~462 million triples; it’s this dataset with 
all graphs merged together: 
https://github.com/INCATools/ubergraph?tab=readme-ov-file#downloads

I have loaded all the triples into a named graph in TDB2 using this command:

tdb2.tdbloader --loc tdb --graph 'http://example.org/ubergraph’ ubergraph.nt.gz

My fuseki config is like this:

[] rdf:type fuseki:Server ;
 ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "12" ] ;
 fuseki:services ( <#my-service> ) .

<#my-service> rdf:type fuseki:Service ;
 fuseki:name  "union" ;
 fuseki:serviceQuery  "sparql" ;
 fuseki:serviceReadGraphStore "get" ;
 fuseki:dataset   <#dataset> .

<#dataset> rdf:type  tdb2:DatasetTDB2 ;
 tdb2:location "tdb" ;
 tdb2:unionDefaultGraph true .

This is my query:

PREFIX rdfs: 
PREFIX cell: 
PREFIX organ: 
PREFIX abdomen: 
PREFIX part_of: 
SELECT DISTINCT ?cell ?organ
FROM 
WHERE {
   ?cell rdfs:subClassOf cell: .
   ?cell part_of: ?organ .
   ?organ rdfs:subClassOf organ: .
   ?organ part_of: abdomen: .
   ?cell rdfs:label ?cell_label .
   ?organ rdfs:label ?organ_label .
}

Using the FROM line causes the query to complete in about 40 seconds. Deleting 
the FROM line allows the query to complete in about 5 seconds.

The reason I was testing this in TDB2 is that I first noticed this behavior 
with an HDT backend, and wanted to make sure it wasn’t only an HDT issue. If I 
create a dataset using an HDT graph as the default graph, the query completes 
in a fraction of a second, but if I use the graph as a named graph the time 
jumps to about 20 seconds. For both of these scenarios (TDB2 and HDT) there is 
only a single named graph in the dataset.

Is there any way to improve performance when using FROM in the query?


Hi Jim,

What happens if you use GRAPH rather than FROM?

WHERE {
   GRAPH  {
 ?cell rdfs:subClassOf cell: .
 ?cell part_of: ?organ .
 ?organ rdfs:subClassOf organ: .
 ?organ part_of: abdomen: .
 ?cell rdfs:label ?cell_label .
 ?organ rdfs:label ?organ_label .
   }
}

FROM builds a "view dataset" which is general purpose (e.g. multiple 
FROM are possible) but which is less efficient for basic graph pattern 
matching. It does not use the TDB2 basic graph pattern matcher.


GRAPH restricts to a single graph and the query goes direct to TDB2 
basic graph pattern matcher.




If there is only one name graph, is here a reason to have it as a named 
graph? Using the default graph and no unionDefaultGraph may be


Andy



Thank you,
Jim



Re: Requesting advice on Fuseki memory settings

2024-03-16 Thread Andy Seaborne




On 12/03/2024 13:17, Gaspar Bartalus wrote:

On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:


On 11/03/2024 14:35, Gaspar Bartalus wrote:

Hi Andy,

On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:



On 08/03/2024 10:40, Gaspar Bartalus wrote:

Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences

between

the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?

Linear increase between compactions or across compactions? The latter
sounds like the previous version hasn't been deleted.


Across compactions, increasing linearly over several days, with

compactions

running every day. The compaction is used with the "deleteOld" parameter,
and there is only one Data- folder in the volume, so I assume compaction
itself works as expected.



Strange - I can't explain that. Could you check that there is only one
Data- directory inside the database directory?


Yes, there is surely just one Data- folder in the database directory.


What's the disk storage setup? e.g filesystem type.


We have an Azure disk of type Standard SSD LRS with a filesystem of type
Ext4.


Hi Gaspar,

I still can't explain what your seeing I'm afraid.

Can we get some more details?

When the server has Data-N -- how big (as reported by 'du -sh') is that 
directory and how big is the whole directory for the database. They 
should be nearly equal.


When a compaction is done, and the server is at Data-(N+1), what are the 
sizes of Data-(N+1) and the database directory?


Does stop/starting the server change those numbers?

Andy


Re: Problems when querying the SPARQL with Jena

2024-03-12 Thread Andy Seaborne




On 12/03/2024 13:02, Anna P wrote:

Hi Lorenz,
Thank you for your reply. Yes, I used maven to build the project. Here are
dependencies details:
Hi Lorenz,

Yes, I used maven to build the project. Here are the dependencies details:

UTF-8
1.8
1.8




junit
junit
4.11
test


org.apache.jena
apache-jena-libs
5.0.0-rc1
pom


org.apache.maven.plugins
maven-assembly-plugin


Is that creating a jar file to run?

The assembly plugin does not manage Java's service loader files. Either 
it needs doing in some way and put into the assembled jar.


There is the shade plugin that manages combined service loader files 
more easily:


https://jena.apache.org/documentation/notes/jena-repack.html

This is how the combined jar jena-fuseki-server is built:

https://github.com/apache/jena/blob/main/jena-fuseki2/jena-fuseki-server/pom.xml#L87-L138

Andy


3.6.0
maven-plugin



Best regards,
Pan

On Tue, Mar 12, 2024 at 7:13 AM Lorenz Buehmann <
buehm...@informatik.uni-leipzig.de> wrote:


Hi,

how did you setup your project? Which Jena version? Do you use Maven?
Which dependencies? It looks like ARQ.init() hasn't been called which
should happen automatically if the setup of the project is correct.


Cheers,
Lorenz

On 11.03.24 14:44, Anna P wrote:

Dear Jena support team,

Currently I just started to work on a SPARQL project using Jena and I

could

not get a solution when I query a model.
I imported a turtle file and ran a simple query, and the snippet code is
shown below. However, I got the error.

public class App {
  public static void main(String[] args) {
  try {
  Model model = RDFDataMgr.loadModel('data.ttl', Lang.TURTLE);
  RDFDataMgr.write(System.out, model, Lang.TURTLE);
  String queryString = "SELECT * { ?s ?p ?o }";
  Query query = QueryFactory.create(queryString);
  QueryExecution qe = QueryExecutionFactory.create(query,

model);

  ResultSet results = qe.execSelect();
  ResultSetFormatter.out(System.out, results, query);
  qe.close();
  } catch (Exception e) {
  e.printStackTrace();
  }
  }
}

Here is the error message:

org.apache.jena.riot.RiotException: Not registered as a SPARQL result set
output syntax: Lang:SPARQL-Results-JSON
  at


org.apache.jena.sparql.resultset.ResultsWriter.write(ResultsWriter.java:179)

  at


org.apache.jena.sparql.resultset.ResultsWriter.write(ResultsWriter.java:156)

  at


org.apache.jena.sparql.resultset.ResultsWriter.write(ResultsWriter.java:149)

  at


org.apache.jena.sparql.resultset.ResultsWriter$Builder.write(ResultsWriter.java:96)

  at


org.apache.jena.query.ResultSetFormatter.output(ResultSetFormatter.java:308)

  at


org.apache.jena.query.ResultSetFormatter.outputAsJSON(ResultSetFormatter.java:516)

  at de.unistuttgart.ki.esparql.App.main(App.java:46)


Thank you for your time and help!

Best regards,

Pan


--
Lorenz Bühmann
Research Associate/Scientific Developer

Email buehm...@infai.org

Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109
Leipzig | Germany






Re: Requesting advice on Fuseki memory settings

2024-03-11 Thread Andy Seaborne




On 11/03/2024 14:35, Gaspar Bartalus wrote:

Hi Andy,

On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:




On 08/03/2024 10:40, Gaspar Bartalus wrote:

Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences between
the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?


Linear increase between compactions or across compactions? The latter
sounds like the previous version hasn't been deleted.



Across compactions, increasing linearly over several days, with compactions
running every day. The compaction is used with the "deleteOld" parameter,
and there is only one Data- folder in the volume, so I assume compaction
itself works as expected.


Strange - I can't explain that. Could you check that there is only one 
Data- directory inside the database directory?


What's the disk storage setup? e.g filesystem type.

Andy


TDB uses sparse files. It allocates 8M chunks per index but that isn't
used immediately. Sparse files are reported differently by different
tools and also differently by different operating systems. I don't know
how k3s is managing the storage.

Sometimes it's the size of the file, sometimes it's the amount of space
in use. For small databases, there is quite a difference.

An empty database is around 220kbytes but you'll see many 8Mbyte files
with "ls -l".

If you zip the database up, and unpack it then it's 193Mbytes.

After a compaction, the previous version of storage can be deleted. The
directory "Data-..." - only the highest numbered directory is used. A
previous one can be zipped up for backup.


The heap memory has some very minimal peaks, saw-tooth, but otherwise

it's

flat.


At what amount of memory?



At ~7GB.





Regards,
Gaspar

On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:




On 07/03/2024 13:24, Gaspar Bartalus wrote:

Dear Jena support team,

We would like to ask you to help us in configuring the memory for our
jena-fuseki instance running in kubernetes.

*We have the following setup:*

* Jena-fuseki deployed as StatefulSet to a k8s cluster with the
resource config:

Limits:
cpu: 2
memory:  16Gi
Requests:
cpu: 100m
memory:  11Gi

* The JVM_ARGS has the following value: -Xmx10G

* Our main dataset of type TDB2 contains ~1 million triples.

A million triples doesn't take up much RAM even in a memory dataset.

In Java, the JVM will grow until it is close to the -Xmx figure. A major
GC will then free up a lot of memory. But the JVM does not give the
memory back to the kernel.

TDB2 does not only use heap space. A heap of 2-4G is usually enough per
dataset, sometimes less (data shape depenendent - e.g. many large
literals used more space.

Use a profiler to examine the heap in-use, you'll probably see a
saw-tooth shape.
Force a GC and see the level of in-use memory afterwards.
Add some safety margin and work space for requests and try that as the
heap size.


*  We execute the following type of UPDATE operations:
 - There are triggers in the system (e.g. users of the application
changing the data) which start ~50 other update operations containing
up to ~30K triples. Most of them run in parallel, some are delayed
with seconds or minutes.
 - There are scheduled UPDATE operations (executed on hourly basis)
containing 30K-500K triples.
 - These UPDATE operations usually delete and insert the same amount
of triples in the dataset. We use the compact API as a nightly job.

*We are noticing the following behaviour:*

* Fuseki consumes 5-10G of heap memory continuously, as configured in
the JVM_ARGS.

* There are points in time when the volume usage of the k8s container
starts to increase suddenly. This does not drop even though compaction
is successfully executed and the dataset size (triple count) does not
increase. See attachment below.

*Our suspicions:*

* garbage collection in Java is often delayed; memory is not freed as
quickly as we would expect it, and the heap limit is reached quickly
if multiple parallel queries are run
* long running database queries can send regular memory to Gen2, that
is not actively cleaned by the garbage collector
* memory-mapped files are also garbage-collected (and perhaps they
could go to Gen2 as well, using more and more storage space).

Could you please explain the possible reasons behind such a behaviour?
And finally could you please suggest a more appropriate configuration
for our use case?

Thanks in advance and best wishes,
Gaspar Bartalus











Re: Requesting advice on Fuseki memory settings

2024-03-08 Thread Andy Seaborne

Hi Jan,

On 08/03/2024 12:31, Jan Eerdekens wrote:

In our data mesh use case we currently also have serious disk issues
because frequently removing/adding and updating data in a dataset seems to
increase the disk usage a lot. We're currently running frequent compact
calls, but especially on the larger datasets these have the tendency to
stall/not finish which eventually causes the system to run out of storage
(even though the actual amount of data is relatively small).


Is there anything in the log files to indicate what is causing the 
compactions to fail?


Jena 5.0.0 wil have a more robust compaction step for linux and MacOS 
(and native Windows eventually - but that is current unreliable. Windows 
 deleting memory mapped files is a well-known, long time JDK issue)



In the beginning we also had some memory/GC issues, but after assigning
some more memory (we're at 12Gb now), tuning some GC parameters, switching
to SSD and adding some CPU capacity the GC issues seem to be under control.
We're currently also looking into configuring the disk to have more IOPS to
see if that can help with the compacting issues we're seeing now.


What size is your data?

What sort of storage class are you using for the database?

Andy



On Fri, 8 Mar 2024 at 11:40, Gaspar Bartalus  wrote:


Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences between
the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?

The heap memory has some very minimal peaks, saw-tooth, but otherwise it's
flat.

Regards,
Gaspar

On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:




On 07/03/2024 13:24, Gaspar Bartalus wrote:

Dear Jena support team,

We would like to ask you to help us in configuring the memory for our
jena-fuseki instance running in kubernetes.

*We have the following setup:*

* Jena-fuseki deployed as StatefulSet to a k8s cluster with the
resource config:

Limits:
   cpu: 2
   memory:  16Gi
Requests:
   cpu: 100m
   memory:  11Gi

* The JVM_ARGS has the following value: -Xmx10G

* Our main dataset of type TDB2 contains ~1 million triples.

A million triples doesn't take up much RAM even in a memory dataset.

In Java, the JVM will grow until it is close to the -Xmx figure. A major
GC will then free up a lot of memory. But the JVM does not give the
memory back to the kernel.

TDB2 does not only use heap space. A heap of 2-4G is usually enough per
dataset, sometimes less (data shape depenendent - e.g. many large
literals used more space.

Use a profiler to examine the heap in-use, you'll probably see a
saw-tooth shape.
Force a GC and see the level of in-use memory afterwards.
Add some safety margin and work space for requests and try that as the
heap size.


*  We execute the following type of UPDATE operations:
- There are triggers in the system (e.g. users of the application
changing the data) which start ~50 other update operations containing
up to ~30K triples. Most of them run in parallel, some are delayed
with seconds or minutes.
- There are scheduled UPDATE operations (executed on hourly basis)
containing 30K-500K triples.
- These UPDATE operations usually delete and insert the same amount
of triples in the dataset. We use the compact API as a nightly job.

*We are noticing the following behaviour:*

* Fuseki consumes 5-10G of heap memory continuously, as configured in
the JVM_ARGS.

* There are points in time when the volume usage of the k8s container
starts to increase suddenly. This does not drop even though compaction
is successfully executed and the dataset size (triple count) does not
increase. See attachment below.

*Our suspicions:*

* garbage collection in Java is often delayed; memory is not freed as
quickly as we would expect it, and the heap limit is reached quickly
if multiple parallel queries are run
* long running database queries can send regular memory to Gen2, that
is not actively cleaned by the garbage collector
* memory-mapped files are also garbage-collected (and perhaps they
could go to Gen2 as well, using more and more storage space).

Could you please explain the possible reasons behind such a behaviour?
And finally could you please suggest a more appropriate configuration
for our use case?

Thanks in advance and best wishes,
Gaspar Bartalus









Re: Requesting advice on Fuseki memory settings

2024-03-08 Thread Andy Seaborne




On 08/03/2024 10:40, Gaspar Bartalus wrote:

Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences between
the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?


Linear increase between compactions or across compactions? The latter 
sounds like the previous version hasn't been deleted.


TDB uses sparse files. It allocates 8M chunks per index but that isn't 
used immediately. Sparse files are reported differently by different 
tools and also differently by different operating systems. I don't know 
how k3s is managing the storage.


Sometimes it's the size of the file, sometimes it's the amount of space 
in use. For small databases, there is quite a difference.


An empty database is around 220kbytes but you'll see many 8Mbyte files 
with "ls -l".


If you zip the database up, and unpack it then it's 193Mbytes.

After a compaction, the previous version of storage can be deleted. The 
directory "Data-..." - only the highest numbered directory is used. A 
previous one can be zipped up for backup.



The heap memory has some very minimal peaks, saw-tooth, but otherwise it's
flat.


At what amount of memory?



Regards,
Gaspar

On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:




On 07/03/2024 13:24, Gaspar Bartalus wrote:

Dear Jena support team,

We would like to ask you to help us in configuring the memory for our
jena-fuseki instance running in kubernetes.

*We have the following setup:*

* Jena-fuseki deployed as StatefulSet to a k8s cluster with the
resource config:

Limits:
   cpu: 2
   memory:  16Gi
Requests:
   cpu: 100m
   memory:  11Gi

* The JVM_ARGS has the following value: -Xmx10G

* Our main dataset of type TDB2 contains ~1 million triples.

A million triples doesn't take up much RAM even in a memory dataset.

In Java, the JVM will grow until it is close to the -Xmx figure. A major
GC will then free up a lot of memory. But the JVM does not give the
memory back to the kernel.

TDB2 does not only use heap space. A heap of 2-4G is usually enough per
dataset, sometimes less (data shape depenendent - e.g. many large
literals used more space.

Use a profiler to examine the heap in-use, you'll probably see a
saw-tooth shape.
Force a GC and see the level of in-use memory afterwards.
Add some safety margin and work space for requests and try that as the
heap size.


*  We execute the following type of UPDATE operations:
- There are triggers in the system (e.g. users of the application
changing the data) which start ~50 other update operations containing
up to ~30K triples. Most of them run in parallel, some are delayed
with seconds or minutes.
- There are scheduled UPDATE operations (executed on hourly basis)
containing 30K-500K triples.
- These UPDATE operations usually delete and insert the same amount
of triples in the dataset. We use the compact API as a nightly job.

*We are noticing the following behaviour:*

* Fuseki consumes 5-10G of heap memory continuously, as configured in
the JVM_ARGS.

* There are points in time when the volume usage of the k8s container
starts to increase suddenly. This does not drop even though compaction
is successfully executed and the dataset size (triple count) does not
increase. See attachment below.

*Our suspicions:*

* garbage collection in Java is often delayed; memory is not freed as
quickly as we would expect it, and the heap limit is reached quickly
if multiple parallel queries are run
* long running database queries can send regular memory to Gen2, that
is not actively cleaned by the garbage collector
* memory-mapped files are also garbage-collected (and perhaps they
could go to Gen2 as well, using more and more storage space).

Could you please explain the possible reasons behind such a behaviour?
And finally could you please suggest a more appropriate configuration
for our use case?

Thanks in advance and best wishes,
Gaspar Bartalus







Re: Requesting advice on Fuseki memory settings

2024-03-07 Thread Andy Seaborne



On 07/03/2024 13:24, Gaspar Bartalus wrote:

Dear Jena support team,

We would like to ask you to help us in configuring the memory for our 
jena-fuseki instance running in kubernetes.


*We have the following setup:*

* Jena-fuseki deployed as StatefulSet to a k8s cluster with the 
resource config:


Limits:
  cpu:     2
  memory:  16Gi
Requests:
  cpu:     100m
  memory:  11Gi

* The JVM_ARGS has the following value: -Xmx10G

* Our main dataset of type TDB2 contains ~1 million triples.

A million triples doesn't take up much RAM even in a memory dataset.

In Java, the JVM will grow until it is close to the -Xmx figure. A major 
GC will then free up a lot of memory. But the JVM does not give the 
memory back to the kernel.


TDB2 does not only use heap space. A heap of 2-4G is usually enough per 
dataset, sometimes less (data shape depenendent - e.g. many large 
literals used more space.


Use a profiler to examine the heap in-use, you'll probably see a 
saw-tooth shape.

Force a GC and see the level of in-use memory afterwards.
Add some safety margin and work space for requests and try that as the 
heap size.



*  We execute the following type of UPDATE operations:
   - There are triggers in the system (e.g. users of the application 
changing the data) which start ~50 other update operations containing 
up to ~30K triples. Most of them run in parallel, some are delayed 
with seconds or minutes.
   - There are scheduled UPDATE operations (executed on hourly basis) 
containing 30K-500K triples.
   - These UPDATE operations usually delete and insert the same amount 
of triples in the dataset. We use the compact API as a nightly job.


*We are noticing the following behaviour:*

* Fuseki consumes 5-10G of heap memory continuously, as configured in 
the JVM_ARGS.


* There are points in time when the volume usage of the k8s container 
starts to increase suddenly. This does not drop even though compaction 
is successfully executed and the dataset size (triple count) does not 
increase. See attachment below.


*Our suspicions:*

* garbage collection in Java is often delayed; memory is not freed as 
quickly as we would expect it, and the heap limit is reached quickly 
if multiple parallel queries are run
* long running database queries can send regular memory to Gen2, that 
is not actively cleaned by the garbage collector
* memory-mapped files are also garbage-collected (and perhaps they 
could go to Gen2 as well, using more and more storage space).


Could you please explain the possible reasons behind such a behaviour?
And finally could you please suggest a more appropriate configuration 
for our use case?


Thanks in advance and best wishes,
Gaspar Bartalus



Re: RDFRemoteConnection to Shiro protected Fuseki

2024-02-17 Thread Andy Seaborne



On 17/02/2024 11:16, Bart van Leeuwen wrote:

Hi,

Forget the whole report, I messed up my shiro config, it works as 
expected with the snippet below.

Good to hear that!

    Andy




Bart

On 2024/02/16 13:50:57 Bart van Leeuwen wrote:
> Hi Andy,
>
> Stand alone example I'll try to work on that.
> This is an app that runs in Apache Tomee
>
> the snippet how I setup the authentication:
>
> AuthEnv.get().registerBasicAuthModifier(provider.getEndpoint(), provider
> .getUser(), provider.getPassword());
>
>     builder = RDFConnectionFuseki.create()
>         .destination(provider.getEndpoint());
>     conn = builder.build();
>
> provider is an internal class that gives me the information I need (all
> double checked to be correct)
>
> the shiro line I use:
>
> /ristore/** = authcBasic,user[admin]
>
> this works from the web UI without issues
>
> Met Vriendelijke Groet / With Kind Regards
> Bart van Leeuwen
>
> On 2024/02/16 10:49:38 Andy Seaborne wrote:
> > Hi Bart,
> >
> > Do you have a complete, ideally runnable, example of how you are 
using

> > RDFConnection and also the client side auth setup.
> >
> >      Andy
> >
> > On 15/02/2024 19:27, Bart van Leeuwen wrote:
> > > Hi,
> > >
> > > I'm runn
Met Vriendelijke Groet / With Kind Regards
Bart van Leeuwen


mastodon: @semanticfire@mastodon.social
tel. +31(0)6-53182997
Netage B.V.
http://netage.nl <http://netage.nl/>
Esdoornstraat 3
3461ER Linschoten
The Netherlands


Re: RDFRemoteConnection to Shiro protected Fuseki

2024-02-16 Thread Andy Seaborne

Hi Bart,

Do you have a complete, ideally runnable, example of how you are using 
RDFConnection and also the client side auth setup.


Andy

On 15/02/2024 19:27, Bart van Leeuwen wrote:

Hi,

I'm running Fuseki 4.9.0. on linux with OpenJDK 17
I've protected it with the shiro configuration and that works without
issues for the web UI.

When I try to connect to the server with RDFConnectionRemoteBuilder or
RDFConnectionFuseki
I get:
Caused by: java.io.IOException: WWW-Authenticate header missing for
response code 401

I've tried all the variations described in:
https://jena.apache.org/documentation/sparql-apis/http-auth.html

but to no avail.

Met Vriendelijke Groet / With Kind Regards
Bart van Leeuwen



[ANN] Apache Jena 5.0.0-rc1

2024-02-14 Thread Andy Seaborne

The Apache Jena development community is pleased to
announce the release of Apache Jena 5.0.0-rc1

In Jena5:

* Minimum: Java 17
* Language tags are case-insensitive unique.
* Term graphs for in-memory models
* RRX - New RDF/XML parser
* Remove support for JSON-LD 1.0
* Turtle/Trig Output : default output PREFIX and BASE
* Artifacts : jena-bom and OWASP CycloneDX SBOM
* API deprecation removal
* Dependency updates : slf4j update : v1 to v2 (needs log4j change)

More details below.

There is no further feature work planned for Jena 5.0.0. This RC release 
is for wider review. The review period will be about a month.


 Contributions:

Balduin Landolt @BalduinLandolt - javadoc fix for Literal.getString.

@OyvindLGjesdal - https://github.com/apache/jena/pull/2121 -- text index fix

Paul Gallagher @TelicentPaul - Code cleanup

Tong Wang @wang3820 Fix tests due to hashmap order



All issues in this release:
https://s.apache.org/jena-5.0.0-rc1-issues

which includes the ones specifically related to Jena5:

  https://github.com/apache/jena/issues?q=label%3Ajena5

** Java Requirement

Java 17 or later is required.
Java 17 language constructs now are used in the codebase.

Jakarta JavaEE required for deploying the WAR file (Apache Tomcat10)

** Language tags

Language tags become are case-insensitive unique.

"abc"@EN and "abc"@en are the same RDF term.

Internally, language tags are formatted using the algorithm of RFC 5646.

Examples "@en", "@en-GB", "@en-Latn-GB".

SPARQL LANG(?literal) will return a formatted language tag.

Data stored in TDB using language tags must be reloaded.

** Term graphs

Graphs are now term graphs in the API or SPARQL. That is, they do not 
match "same value" for some of the java mapped datatypes. The model API 
already normalizes values written.


TDB1, TDB2 keep their value canonicalization during data loading.

A legacy value-graph implementation can be obtained from GraphMemFactory.

** RRX - New RDF/XML parser

RRX is the default RDF/XML parser. It is a replacement for ARP.
RIOT uses RRX.

The ARP parser is still temporarily available for transition assistance.

** Remove support for JSON-LD 1.0

JSON-LD 1.1, using Titanium-JSON-LD, is the supported version of JSON-LD.

https://github.com/filip26/titanium-json-ld

** Turtle/Trig Output

"PREFIX" and "BASE" are output by default for Turtle and TriG output.

** Artifacts

There is now a release BOM for Jena artifacts - artifact 
org.apache.jena:jena-bom


There are now OWASP CycloneDX SBOM for Jena artifacts.
https://github.com/CycloneDX

jena-tdb is renamed jena-tdb1.

jena-jdbc is no longer released

** Dependencies

The update to slf4j 2.x means the log4j artifact changes to
"log4j-slf4j2-impl" (was "log4j-slf4j-impl").


 API Users

** Deprecation removal

There has been a clearing out of deprecated functions, methods and 
classes. This includes the deprecations in Jena 4.10.0 added to show 
code that is being removed in Jena5.


** QueryExecutionFactory

QueryExecutionFactory is simplified to cover commons cases only; it 
becomes a way to call the general QueryExecution builders which are 
preferred and provide all full query execution setup controls.


Local execution builder:
QueryExecution.create()...

Remote execution builder:
QueryExecution.service(URL)...

** QueryExecution variable substitution

Using "substitution", where the query is modified by replacing one or 
more variables by RDF terms, is now preferred to using "initial 
bindings", where query solutions include (var,value) pairs.


"substitution" is available for all queries, local and remote, not just 
local executions.


Rename TDB1 packages org.apache.jena.tdb -> org.apache.jena.tdb1


 Fuseki Users

Fuseki: Uses the Jakarta namespace for servlets and Fuseki has been 
upgraded to use Eclipse Jetty12.


Apache Tomcat10 or later, is required for running the WAR file.
Tomcat 9 or earlier will not work.


== Obtaining Apache Jena 5.0.0-rc1

* Via central.maven.org

The main jars and their dependencies can used with:

  
org.apache.jena
apache-jena-libs
pom
5.0.0-rc1
  

Full details of all maven artifacts are described at:

http://jena.apache.org/download/maven.html

* As binary downloads

Apache Jena libraries are available as a binary distribution of
libraries. For details of a global mirror copy of Jena binaries please see:

http://jena.apache.org/download/

* Source code for the release

The signed source code of this release is available at:

http://www.apache.org/dist/jena/source/

and the signed master source for all Apache Jena releases is available
at: http://archive.apache.org/dist/jena/

== Contributing

If you would like to help out, a good place to look is the list of
unresolved JIRA at:

https://https://github.com/apache/jena/issuesissues-current

or review pull requests at

https://github.com/apache/jena/pulls

or drop into the dev@ list.

We use github pull requests and 

Re: Database Migrations in Fuseki

2024-02-09 Thread Andy Seaborne

Hi Balduin,

On 07/02/2024 11:05, Balduin Landolt wrote:

Hi everyone,

we're storing data in Fuseki as a persistence for our application backend,
the data is structured according to the application logic. Whenever
something changes in our application logic, we have to do a database
migration, so that the data conforms to the updated model.
Our current solution to that is very home-spun, not exactly stable and
comes with a lot of downtime, so we try to avoid it whenever possible.


If I understand correctly, this is a schema change requiring the data to 
change.


The transformation of the data to the updated data model could be done 
offline, that would reduce downtime. If the data is being continuously 
updated, that's harder because the offline copy will get out of step 
with the live data.


How often does the data change (not due to application logic changes)?


I'm now looking into how this could be improved in the future. My double
question is:
1) is there any tooling I missed, to help with this process? (In SQL world
for example, there are out of the box solutions for that.)
2) and if not, more broadly, does anyone have any hints on how I could best
go about this?


Do you have a concrete example of such a change? maybe chnage-in-place 
is possible but that depends on w=howupdates happen, how the dada feeds 
change with the application logic change.


Andy



Thanks in advance!
Balduin






Re: jena-fuseki UI in podman execution (2nd effort without attachments)

2024-02-09 Thread Andy Seaborne

Hi Jaana,

Glad you got it sorted out.

The Fuseki UI does not do anything special about browser caches. There 
was a major UI update with implementing it in Vue and all the HTML 
assets that go with that.


Andy

On 09/02/2024 05:37, jaa...@kolumbus.fi wrote:

Hi, I just noticed that it's not  question about podman or docker but about 
browser cache. After deleting everything in browser cache I managed to get the 
correct user interface when running stain/jena-fuseki:3.14.0 and 
stain/jena-fuseki:4.0.0 by both podman and docker, but when I tried the latest 
stain/jena-fuseki (4.8.0) I got the incorrect interface (shown here 
https://github.com/jamietti/jena/blob/main/fuseki-podman.png).

Jaana M



08.02.2024 13.23 EET jaa...@kolumbus.fi kirjoitti:

  
Hi, I've running jena-fuseki with docker:
  
docker run -p 3030:3030 -e ADMIN_PASSWORD=pw123 stain/jena-fuseki
  
and rootless podman:
  
podman run -p 3030:3030 -e ADMIN_PASSWORD=pw123 docker.io/stain/jena-fuseki
  
when excuted the same version 4.8.0 of jena-fuseki with podman the UI looks totally different from the UI of the instance excuted with docker.
  
see file fuseki-podman.png https://github.com/jamietti/jena/blob/main/fuseki-podman.png in https://github.com/jamietti/jena/

What can cause this problem ?
  
Br, Jaana M


Re: Restart during Fuseki compaction

2024-02-07 Thread Andy Seaborne

Recorded as https://github.com/apache/jena/issues/2254

On 06/02/2024 23:06, Andy Seaborne wrote:

Hi Samuel,

This is when the server exists for some reason?

(If it's an internal exception, there should be a stack trace in the log 
file.)


What operating system are you running on?

What's in the new Data-0002 directory?

It does look like some defensive measures are needed to not choose to 
use the incomplete storage directory.


     Andy


On 06/02/2024 09:26, Samuel Börlin wrote:

Hi everybody,

I recently noticed that when Fuseki (4.10.0) is stopped during a 
compaction task (started via the HTTP endpoint 
`/$/compact/{name}?deleteOld=true`)
then it uses the new and still incomplete database (e.g. Data-0002 
instead of the original non-compacted Data-0001) when it is started 
again.
Is there a way to do compaction in an atomic manner so that this 
doesn't happen?


As a workaround I'm currently thinking about simply deleting (or 
perhaps renaming/moving) all Data- directories but the one with 
the lowest index when the database is started.
I always use `?deleteOld=true`, so I only ever expect there to be one 
Data- directory when it starts. If there are multiple directories 
then that means that there must have been an incomplete compaction.

Does this seem like a reasonable approach?

Thanks and best regards,
Samuel


Re: Restart during Fuseki compaction

2024-02-06 Thread Andy Seaborne

Hi Samuel,

This is when the server exists for some reason?

(If it's an internal exception, there should be a stack trace in the log 
file.)


What operating system are you running on?

What's in the new Data-0002 directory?

It does look like some defensive measures are needed to not choose to 
use the incomplete storage directory.


Andy


On 06/02/2024 09:26, Samuel Börlin wrote:

Hi everybody,

I recently noticed that when Fuseki (4.10.0) is stopped during a compaction 
task (started via the HTTP endpoint `/$/compact/{name}?deleteOld=true`)
then it uses the new and still incomplete database (e.g. Data-0002 instead of 
the original non-compacted Data-0001) when it is started again.
Is there a way to do compaction in an atomic manner so that this doesn't happen?

As a workaround I'm currently thinking about simply deleting (or perhaps 
renaming/moving) all Data- directories but the one with the lowest index 
when the database is started.
I always use `?deleteOld=true`, so I only ever expect there to be one Data- 
directory when it starts. If there are multiple directories then that means 
that there must have been an incomplete compaction.
Does this seem like a reasonable approach?

Thanks and best regards,
Samuel


Re: question about FROM keyword

2024-02-05 Thread Andy Seaborne

This is a combination of things happening.

In the one case of no data (grph or dataset) provided, Jena does read 
the URL. If there is supplied data, FROM refers the dataset.


The URL is coming back from www.learningsparql.com
as explicitly "Content-Type: text/plain", not "text/turtle".

Jena pretty much ignores "text/plain" because it is usually wrong, so it 
tries to guess the syntax.


The URL in the message

  (URI=file:///D:/neli/cs575Spring24/ex070mod2.rq : stream=text/plain)

is misleading - that "URI" is the base URI, not the URI being read.

> (This specifically may be a bug in the arq tool)

Yes, it is.

Recorded as https://github.com/apache/jena/issues/2250

Corrected, the results are:

-
| last   | first | courseName   |
=
| "Mutt" | "Richard" | "Updating Data with SPARQL"  |
| "Mutt" | "Richard" | "Using SPARQL with non-RDF Data" |
| "Marshall" | "Cindy"   | "Modeling Data with OWL" |
| "Marshall" | "Cindy"   | "Using SPARQL with non-RDF Data" |
| "Ellis"| "Craig"   | "Using SPARQL with non-RDF Data" |
-

Workarounds:
1/ Download the file using curl or wget as suggested
2/ Set the base on the command line with
   --base http://www.learningsparql.com/2ndeditionexamples/ex069.ttl


The message

ERROR StatusLogger Reconfiguration failed: No configuration found for 
'73d16e93' at 'null' in 'null'


is unrelated.

It is the command not finding the logging set up - I don't know why that 
is happening.


Try copying the log4j2.properties from the distribution directory into 
the current directory.


Andy

On 05/02/2024 13:06, Zlatareva, Neli (Computer Science) wrote:

Hi Rob, thank you so much for the quick response. What made me wonder was that 
this same FROM from arq on command line worked perfectly fine in the past (was 
able to access remote files). However, I assume that for different reasons 
(security?) this is not the case anymore.
Truly appreciate the help.
Thanks.
Regards, Neli.

Neli P. Zlatareva, PhD
Professor of Computer Science
Department of Computer Science
Central Connecticut State University
New Britain, CT 06050
Phone: (860) 832-2723
Fax: (860) 832-2712
Web site: cs.ccsu.edu/~neli/

From: Rob @ DNR 
Sent: Monday, February 5, 2024 6:32 AM
To: users@jena.apache.org 
Subject: Re: question about FROM keyword

EXTERNAL EMAIL: This email originated from outside of the organization. Do not 
click any links or open any attachments unless you trust the sender and know 
the content is safe.

So, there’s a couple of things happening here.

Firstly, Jena’s SPARQL engine always treats FROM (and FROM NAMED) as referring 
to graphs in the local dataset.  So, it doesn’t matter that the URL in your 
FROM is a valid RDF resource on the web, Jena won’t try and load that by 
default, it just looks for a graph with that URI in the local dataset.

Nothing in the SPARQL specifications requires that these URLs be treated 
otherwise.  Some implementations choose to resolve these URIs from the web but 
that isn’t required by the standard, and from a security standpoint isn’t a 
good idea.

Secondly, the ARQ command line tool the local dataset is usually an implicit 
empty dataset if you don’t supply one.  Except as it turns out when you supply 
a FROM/FROM NAMED, in which case it tries to build one given the inputs it has. 
 In this case that’s only your query file which isn’t valid when treated as an 
RDF dataset, thus you get the big nasty stack trace you reported.  (This 
specifically may be a bug in the arq tool)

You can avoid this second problem by supplying an empty data file e.g.

  arq --query query.rq --data empty.ttl

But that will only serve to highlight the first issue, that Jena only treats 
FROM/FROM NAMED as references to graphs in the local dataset, and you’ll get an 
empty result from your query.

You are better off downloading the RDF data you want to query locally and then 
running arq and supplying both a query file and a data file.

Hope this helps,

Rob

From: Zlatareva, Neli (Computer Science) 
Date: Monday, 5 February 2024 at 01:40
To: users@jena.apache.org 
Subject: question about FROM keyword
Hi there, I am trying the following arq query from command window
(works fine if I am getting the file locally)

PREFIX ab: 

Re: ARQInternalErrorException during query execution in Jena 4.10.0

2024-01-04 Thread Andy Seaborne

https://github.com/apache/jena/discussions/2150

The query shows is not the one generated by the update builder.



Re: ARQInternalErrorException during query execution in Jena 4.10.0

2024-01-04 Thread Andy Seaborne




On 03/01/2024 20:58, Dhamotharan, Kishan wrote:

Hi all,

We are attempting to upgrade from Jena 3.5 to Jena 4.10.0.
We are using  “RDFConnection.connect(TDBFactory.createDataset());” for unit 
tests.
The below query works totally fine in Jena 3.5 but fails with the following 
exception in Jena 4.10.0.
I have confirmed that the query is correct and works totally fine in Neptune 
RDF as well. Can you please help us on how to go about this ? or please suggest 
if the query needs to updated to something else for jena 4.10.0.


If the update request works, try that exact string locally.

If that works, try converting the output of the UpdateBuilder to a 
string, and parsing it back:


updateRequest = UpdateFactory.create(updateRequest.toString());
update(conn, updateRequest);

If that works, then there is a problem in the UpdateBuilder.
Whether that is in the way it is being used or a bug in the 
UpdateBuilder itself isn't clear.


Reduce the test case to a simpler update.

> from Jena 3.5 to Jena 4.10.0.

It would helpful if you could bisect on the versions to identify which 
version introduced the problem.



I have also attached the code sample to reproduce the issue.


The code does not compile. Is it an extrat of Groovy?

There is missing code and multiple syntax errors. It is very helpful to 
have code that runs exactly without needing toi be fixed up because in 
fixing it up, some assumption may be made that relates to the problem at 
hand.


One example:

> private final graph1 = getNamedGraph()

Bad Java syntax 1. no ";"  Is this because its Groovy?
Has other text also been lost? Groovy may be returnign a bad chocie of type.

Bad Java syntax 2.  No type declaration - inserting bad data into a 
builder can make it fail.


What's "getNamedGraph()"?

Ditto createURI()



Query :

INSERT {
   GRAPH 
 {
   "o1" .
   }
 }
 WHERE
   { GRAPH 

   { }
 GRAPH 

   { FILTER NOT EXISTS { "o3" }}
   }

Error :

 org.apache.jena.sparql.ARQInternalErrorException: compile(Element)/Not a 
structural element: ElementFilter
 at 
app//org.apache.jena.sparql.algebra.AlgebraGenerator.broken(AlgebraGenerator.java:577)
 at 
app//org.apache.jena.sparql.algebra.AlgebraGenerator.compileUnknownElement(AlgebraGenerator.java:170)
 at 
app//org.apache.jena.sparql.algebra.AlgebraGenerator.compileElement(AlgebraGenerator.java:156)
 at 
app//org.apache.jena.sparql.algebra.AlgebraGenerator.compileElementGraph(AlgebraGenerator.java:426)
 at 
app//org.apache.jena.sparql.algebra.AlgebraGenerator.compileElement(AlgebraGenerator.java:133)
 at 
app//org.apache.jena.sparql.algebra.AlgebraGenerator.compileOneInGroup(AlgebraGenerator.java:319)
 at 
app//org.apache.jena.sparql.algebra.AlgebraGenerator.compileElementGroup(AlgebraGenerator.java:202)
 at 
app//org.apache.jena.sparql.algebra.AlgebraGenerator.compileElement(AlgebraGenerator.java:127)
 at 
app//org.apache.jena.sparql.algebra.AlgebraGenerator.compile(AlgebraGenerator.java:113)
 at 
app//org.apache.jena.sparql.algebra.AlgebraGenerator.compile(AlgebraGenerator.java:100)
 at app//org.apache.jena.sparql.algebra.Algebra.compile(Algebra.java:73)
 at 
app//org.apache.jena.sparql.engine.QueryEngineBase.createOp(QueryEngineBase.java:140)
 at 
app//org.apache.jena.sparql.engine.QueryEngineBase.(QueryEngineBase.java:57)
 at 
app//org.apache.jena.sparql.engine.main.QueryEngineMain.(QueryEngineMain.java:45)
 at 
app//org.apache.jena.tdb.solver.QueryEngineTDB.(QueryEngineTDB.java:63)
 at 
app//org.apache.jena.tdb.solver.QueryEngineTDB$QueryEngineFactoryTDB.create(QueryEngineTDB.java:135)
 at 
app//org.apache.jena.query.QueryExecutionFactory.makePlan(QueryExecutionFactory.java:442)
 at 
app//org.apache.jena.query.QueryExecutionFactory.createPlan(QueryExecutionFactory.java:418)
 at 
app//org.apache.jena.sparql.modify.UpdateEngineWorker.evalBindings(UpdateEngineWorker.java:532)
 at 
app//org.apache.jena.sparql.modify.UpdateEngineWorker.visit(UpdateEngineWorker.java:371)
 at 
app//org.apache.jena.sparql.modify.request.UpdateModify.visit(UpdateModify.java:100)
 at 
app//org.apache.jena.sparql.modify.UpdateVisitorSink.send(UpdateVisitorSink.java:45)
 at 
app//org.apache.jena.sparql.modify.UpdateVisitorSink.send(UpdateVisitorSink.java:31)
 at 
java.base@17.0.9/java.util.ArrayList$Itr.forEachRemaining(ArrayList.java:1003)
 at 
java.base@17.0.9/java.util.Collections$UnmodifiableCollection$1.forEachRemaining(Collections.java:1061)
 at 

Jena5: what to expect

2023-12-30 Thread Andy Seaborne

Jena5 is the next planned release for Apache Jena.

** All issues for Jena5:

https://github.com/apache/jena/issues?q=is%3Aissue+label%3AJena5

** Java Requirement

Java 17 or later is required.
Java 17 language constructs now are used in the codebase.

** Language tags

Language tags become are case-insensitive unique.

"abc"@EN and "abc"@en are the same RDF term.

Internally, language tags are formatted using the algorithm of RFC 5646.

Examples "@en", "@en-GB", "@en-Latn-GB".

SPARQL LANG(?literal) will return a formatted language tag.

Data stored in TDB using language tags must be reloaded.

** Term graphs

The default in-memory graphs become term graphs for consistency across 
all Jena storage options; they do not match "same value" for some of the 
java mapped datatypes e.g int 1 does not match "001"^^xsd:int. The model 
API has always normalized values written, e.g. "1"^^xsd:int


TDB1, TDB2 keep their value canonicalization during data loading.

A legacy value-graph implementation can be obtained from GraphMemFactory.

** RRX - New RDF/XML parser

RRX is a new RDF/XML parser. It is a replacement for ARP and will be the 
default.


Differences to ARP:
  * daml:collection is not supported.
  * Strict rdf:parseType
  * Relative namespaces supported.

The ARP parser will, temporarily, still available for any
transition assistance.

** Remove support for JSON-LD 1.0

JSON-LD 1.1, using Titanium-JSON-LD, is the supported version of JSON-LD.

https://github.com/filip26/titanium-json-ld

** Turtle/Trig Output

The "PREFIX" and "BASE" forms are output by default for Turtle and TriG 
output. See RIOT.symTurtleDirectiveStyle.



 API Users

** Deprecation removal

There has been a general clearing out of deprecated functions, methods 
and classes. This includes deprecations in Jena 4.10.0 added to show 
code that is being removed in Jena5.


** QueryExecutionFactory

QueryExecutionFactory is simplified to cover commons cases only; it 
becomes a way to call the more general QueryExecution builders which 
support custom query execution setup.


Local execution builder:
  QueryExecution.create()...

Remote execution builder:
  QueryExecution.service(URL)...

** QueryExecution variable substitution

Using "substitution", where the query is modified by replacing one or 
more variables by RDF terms, is now preferred to using "initial 
bindings", where query solutions include (var,value) pairs.


"substitution" is available for all queries, local and remote, not just 
local executions.



 Fuseki Users

Fuseki: Uses the jakarta namespace for servlets and Fuseki has been 
upgraded to use Eclipse Jetty12.


Apache Tomcat10 or later, is required for running the WAR file.
Tomcat 9 or earlier will not work.


Re: Parallel requests on multiple fuseki

2023-12-14 Thread Andy Seaborne

Jorge,

Have you looked at

https://jena.apache.org/documentation/query/service_enhancer.html

It might have features of use to you.

Andy

On 14/12/2023 08:25, George News wrote:

Hi,

I have deployed several Fuseki instances. This email scenario is just 
for 2.


I was testing the SERVICE option in order to launch the same request to
both instances of Fuseki and merge the result under one response.

The SPARQL request I launched is the following:

prefix rdf: 
SELECT * WHERE {
   { SERVICE 
     {
   SELECT ?anything WHERE{?anything rdf:type ?bb}
     } BIND ( AS ?serviceLabel)
   }
   UNION
   { SERVICE 
  {
   SELECT ?anything WHERE{?anything rdf:type ?bb}
  } BIND ( AS ?serviceLabel)
   }
}

The result was the expected. However when using Wireshark and analysing
the logs, I noticed that the request are not send in parallel, but just
one and then the other. This is somehow a waste of time ;)

Is there any way to parallelize sending the same request to many Fuseki
instances and merge the responses? I guess I can make my own solution
using Jena, but I wanted to know if it would be possible using SPARQL.

Thanks.
Jorge


Re: Checking that SPARQL Update will not validate SHACL constraints

2023-12-13 Thread Andy Seaborne




On 13/12/2023 15:49, Arne Bernhardt wrote:

Hello Martynas,

I have no experience with implementing a validation layer for Fuseki.

But I might have an idea for your suggested approach:
Instead of loading a copy of the graph and modifying it, you could create
an org.apache.jena.graph.compose.Delta based on the unmodified graph.
Then apply the update to the delta graph and validate the SHACL on the
delta graph. If the validation is successful, you can safely apply the
update to the original graph and discard the delta graph.

You still have to deal with concurrency. For example, the original graph
could be changed by a second, faster update while you are still validating
the first update. It would not be safe to apply the validated changes to a
graph that has been changed in the meantime.

Arne


It'll depends in the SHACL. Many constraints don't need all the data 
available. Some need just the subject and all properties (e.g. 
sh:maxCount). Some need all the data (SPARQL ones - they are opaque to 
analysis so the general way is they need all the data).


If the proxy layer is same JVM, BufferingDatasetGraph may help.
It can be used to capture the adds and deletes. It can then be validated 
(all data or only the data changing). Flush the changes to the database 
just before the end of the request in the proxy level commit.


If the proxy is in a different JVM, then only certain constraints can be 
supported but they do tend to be the most common checks.


Andy






Am Mi., 13. Dez. 2023 um 14:29 Uhr schrieb Martynas Jusevičius <
marty...@atomgraph.com>:


Hi,

I have an objective to only persist constraint-validated data in Fuseki.

I have a proxy layer that validates all incoming GSP PUT and POST
request graphs in memory and rejects the invalid ones. So far so good.

What about SPARQL Update requests though? For simplicity's sake, let's
say they are restricted to a single graph as in GSP PATCH [1].
What I can think of is first loading the graph into memory and
executing the update, and then validating the resulting graph against
SHACL. But maybe there's a smarter way?

Also interested in the more general case without the graph restriction.

Martynas

[1] https://www.w3.org/TR/sparql11-http-rdf-update/#http-patch





Re: Unable to build the below query using jena query builder

2023-12-08 Thread Andy Seaborne




On 08/12/2023 02:08, Dhamotharan, Kishan wrote:

Hello Lorenz,

Thanks for your response.


...



Since query builder 3.5 does not have addWhereValueVar is there any other way 
to build the query ?

It’s a very painful process to pull in three party / open source libraries, 
requires multiple approvals and adding a new version would involve a very 
tedious task of manually upgrading and pulling in the dependences and get them 
to work with the in house build system. Would be great if we have a workaround 
for this.


If you're unwilling to upgrade (and the 6 year old 3.5.0 has CVE issues 
raise against it so upgrading would be a very good idea) then you could 
consider taking the query builder source code. It is a self-contained 
feature of Apache Jena should back-port quite easily.


Andy


Re: Problem running AtomGraph/fuseki-docker

2023-12-07 Thread Andy Seaborne

[2023-12-06 22:19:53] INFO  Server  :: Path = /'/ds'

Not good. Shell quoting didn't happen. That's a URL path component 
called '/ds' in the server root.


Andy

On 06/12/2023 23:55, Steve Vestal wrote:

I was using bash.  When I run it in command prompt, it works. Thanks!

Interestingly, when the command prompt is closed, the container is 
removed from Docker Desktop.  Each new start creates a new container 
with a new amusing name :-)


C:\Users\svestal>docker run --rm -p 3030:3030 atomgraph/fuseki --mem '/ds'
[2023-12-06 22:19:53] INFO  Server  :: Apache Jena Fuseki 4.6.1
[2023-12-06 22:19:53] INFO  Server  :: Database: in-memory
[2023-12-06 22:19:53] INFO  Server  :: Path = /'/ds'
[2023-12-06 22:19:53] INFO  Server  :: System
[2023-12-06 22:19:53] INFO  Server  ::   Memory: 2.0 GiB
[2023-12-06 22:19:53] INFO  Server  ::   Java:   17-ea
[2023-12-06 22:19:53] INFO  Server  ::   OS: Linux 
5.15.133.1-microsoft-standard-WSL2 amd64

[2023-12-06 22:19:53] INFO  Server  ::   PID:    1
[2023-12-06 22:19:53] INFO  Server  :: Start Fuseki (http=3030)

On 12/6/2023 2:12 PM, Martynas Jusevičius wrote:

Hi Steve,

This looks like Windows shell issue.

For some reason /ds is resolved as a filepath where it shouldn’t.

Can you try —mem '/ds' with quotes?

I’m running Docker on WSL2 and never had this problem.

Martynas

On Wed, 6 Dec 2023 at 21.05, Steve Vestal  
wrote:



I am running a VM with Microsoft Windows Server 2019 (64-bit). When I
try to stand up the docker server, I get

$ docker run --rm -p 3030:3030 atomgraph/fuseki --mem /ds
String '/C:/Program Files/Git/ds' not valid as 'service'

Suggestions?




Re: Text indexing stopped working

2023-11-30 Thread Andy Seaborne

There isn't much information to go.

On 29/11/2023 09:50, Mikael Pesonen wrote:

No idea?

On 16/11/2023 13.11, Mikael Pesonen wrote:
What could be the reason why new data is suddenly not added to text 
index and not found with Jena text queries?


The newest files in Jena text index folder are zero sized

_b_Lucene85FieldsIndexfile_pointers_n.tmp
_b_Lucene85FieldsIndex-doc_ids_m.tmp

dated 2023-11-13 although I have added lots of data since then using 
same methods as before. Text queries find all the data before this date.


BR




Re: Querying URL with square brackets

2023-11-25 Thread Andy Seaborne




On 25/11/2023 13:47, Marco Neumann wrote:

I was looking for an IRI validator and this one didn't come up in the
search engines. This service might need a bit more visibility and some
incoming links.


It gets lost in all the code library "validators"



Marco

On Sat, Nov 25, 2023 at 1:34 PM Andy Seaborne  wrote:




On 24/11/2023 10:05, Marco Neumann wrote:

(side note) preferably the local name of a URI should not start with a
number but a letter or underscore.


It's a hangover from XML QNames.

Turtle doesn't care.

Style-wise, yes, avoid an initial number.


What do you mean by human-readable here? For large technical systems it's
simply not feasible to encode meaning into the URI and I might even
consider it an anti-pattern.

There are some community efforts that have introduced single letters and
number sequences for vocabulary development like CIDOC CRM which was

later

also adopted by community projects like wikidata. But instance data
typically doesn't have that requirement and can be random but has to be
syntax compliant of course.

I am sure Andy can elaborate on the details of the encoding here.


There's an online IRI validator.

https://sparql.org/iri-validator.html

using the jena-iri package.






Re: Querying URL with square brackets

2023-11-25 Thread Andy Seaborne




On 24/11/2023 10:05, Marco Neumann wrote:

(side note) preferably the local name of a URI should not start with a
number but a letter or underscore.


It's a hangover from XML QNames.

Turtle doesn't care.

Style-wise, yes, avoid an initial number.


What do you mean by human-readable here? For large technical systems it's
simply not feasible to encode meaning into the URI and I might even
consider it an anti-pattern.

There are some community efforts that have introduced single letters and
number sequences for vocabulary development like CIDOC CRM which was later
also adopted by community projects like wikidata. But instance data
typically doesn't have that requirement and can be random but has to be
syntax compliant of course.

I am sure Andy can elaborate on the details of the encoding here.


There's an online IRI validator.

https://sparql.org/iri-validator.html

using the jena-iri package.


Re: Querying URL with square brackets

2023-11-25 Thread Andy Seaborne




On 24/11/2023 08:55, Marco Neumann wrote:

Laura, see jena issue #2102
https://github.com/apache/jena/issues/2102


It's specific to [].

Because data formats accept these bad URIs (with a warning), the fact 
SPARQL generates errors is a bug to be fixed.


Andy



Marco

On Fri, Nov 24, 2023 at 7:12 AM Laura Morales  wrote:


I have a few URLs containing square brackets like
http://example.org/foo[1]bar
I can create a TDB2 dataset without much problems, with warnings


Warnings exist for a reason!

>> but no errors.



I tried escaping, "foo\[1\]bar" but it doesn't work.


URIs don't accept \ escapes.

And U+ doesn't help because the check isn't just in the parser.



Re: Querying URL with square brackets

2023-11-25 Thread Andy Seaborne




On 24/11/2023 10:40, Marco Neumann wrote:

The URI syntax is defined by the Internet Engineering Task Force (IETF) in
RFC 3986.

W3C RDF is just a rule-taker here ;)

https://datatracker.ietf.org/doc/html/rfc3986


We've drafted a non-normative section:

https://www.w3.org/TR/rdf12-concepts/#iri-abnf

which is all the RFCs we could find and adopting the current state of 
terminology.


Nowadays, URI and IRI are interchangeable. Only use in HTTP requests 
worries about ASCII vs UTF-8 and then only in old software. Use a 
toolkit and it'll sort it out.


Only the URI scheme name is restricted to A-Z.

   Andy



Marco

On Fri, Nov 24, 2023 at 10:36 AM Laura Morales  wrote:


What do you mean by human-readable here? For large technical systems it's
simply not feasible to encode meaning into the URI and I might even
consider it an anti-pattern.


This is my problem. I do NOT want to encode any meaning into URLs, but I
do want them to be human readable simply because I) properties are URLs
too, 2) they can be used online, and 3) they are simpler to work with, for
example editing in a Turtle file or writing a query.

:alice :knows :bobvs:dsa7hdsahdsa782j :d93ifg75jgueeywu
:s93oeirugj290sjf

I can avoid [ entirely, but it rises the question of what other characters
I MUST avoid.


{} {}

You can use () but hierarchical names are better.

Be careful about ':' because it can't be in the first segment of a path 
of a relative URI (it looks like a scheme name).


Andy








RDF URI references [Was: Querying URL with square brackets]

2023-11-25 Thread Andy Seaborne
Another option is the HTTP query string - think of it as asking a 
question of resource "http://example.org/book;


Andy

On 24/11/2023 11:03, Martynas Jusevičius wrote:

On Fri, Nov 24, 2023 at 11:46 AM Laura Morales  wrote:



in the case that I want to use these URLs with a web browser.


I don't understand what the trouble with the above example is?


The problem with # is that browsers treat them as the start of a local 
reference. When you open http://example.org/book#1 the server only receives 
http://example.org/book. In other words it would be an error to create nodes 
for n different books (#1 #2 #3 #n) if my goal is also to use these URLs with a 
browser (for example if I want to show one page for every book). It's not a 
problem with Jena, it's a problem with the way browsers treat the fragment.


If you want a page for every book, don't use fragment URIs. Use
http://example.org/book/1 or http://example.org/book/1#this instead of
  http://example.org/book#1.


Re: Implicit default-graph-uri

2023-11-19 Thread Andy Seaborne




On 18/11/2023 08:21, Laura Morales wrote:

I've tried this option too using the following configuration


fuseki:dataset [
 a ja:RDFDataset;

 ja:defaultGraph [
 a ja:UnionModel ;

 ja:subModel [
 a tdb2:GraphTDB2 ;
 tdb2:dataset [
 a tdb2:DatasetTDB2 ;
 tdb2:location "location1"
 ]
 ] ;

 ja:subModel [
 a tdb2:GraphTDB2 ;
 tdb2:dataset [
 a tdb2:DatasetTDB2 ;
 tdb2:location "location2"
 ]
 ] ;
 ]
]


but it always gives me "transaction error" with any query. I've tried TDB 1 
instead, but it gives me a different error:

ERROR Server  :: Exception in initialization: the (group) Assembler 
org.apache.jena.assembler.assemblers.AssemblerGroup$PlainAssemblerGroup@b73433 
cannot construct the object [...] [ja:subModel of [...] [ja:defaultGraph of 
[...] ]] because it does not have an implementation for the objects's most 
specific type ja:Model

I've found a couple of old threads online with people reporting "MultiUnion" as 
working, but I don't know how to use this configuration. I couldn't find it on the Fuseki 
documentation and simply replacing ja:UnionModel for ja:MultiUnionModel doesn't make any 
difference for me.
Do you know anything about this MultiUnion and if it could work?


Only with use of default-graph-uri or SELECT FROM.

Having a dadaset description in the request itself causes the processor 
to have a per-request dataset with a java class GraphUnionRead as the 
default graph. GraphUnionRead copes with the transaction setup across 
the two locations.


As things stand at the moment, other ways of constructing a suitable 
dataset don't use GraphUnionRead.


Using a service name of

   "/service/query/?default-graph-uri=urn:x-arq:UnionGraph".

Tools generally cope with query string in the URL and correctly assmble 
the URL:


Java:

QueryExecution qExec =
   QueryExecutionHTTP

.service("http://localhost:3030/ds/?default-graph-uri=urn:x-arq:UnionGraph;)
.query("SELECT * { ?s ?p ?o}")
.build();

or
 curl -d 'query=SELECT * {?s ?p ?o}'
   'http://localhost:3030/ds/?default-graph-uri=urn:x-arq:UnionGraph'

Andy




Re: Implicit default-graph-uri

2023-11-17 Thread Andy Seaborne




On 16/11/2023 11:35, Laura Morales wrote:

I would like to configure Fuseki such that I can use 2 datasets from 2 
different locations, as if they were a single dataset.
This is my config.ttl:


<#> a fuseki:Service ;

 fuseki:endpoint [
 fuseki:operation fuseki:query
 ] ;

 fuseki:dataset [
 a ja:RDFDataset ;

 ja:namedGraph [
 ja:graphName :graph1 ;
 ja:graph [
 a tdb2:GraphTDB ;
 tdb2:location "location-1" ;
 ]
 ] ;

 ja:namedGraph [
 ja:graphName :graph2 ;
 ja:graph [
 a tdb2:GraphTDB ;
 tdb2:location "location-2" ;
 ]
 ] ;
 ] .


There is no particular reason why I used this configuration; I mostly copied it 
from the Fuseki documentation. If it can be simplified, please suggest how.

I query Fuseki with "/service/query/?default-graph-uri=urn:x-arq:UnionGraph". I also know that I can use 
"SELECT FROM ". But I would like to know if I can configure this behavior as 
the default in the main configuration file, such that I can avoid using "x-arq:UnionGraph" entirely.
Both datasets are TDB2 and contain triples only in the default unnamed graph 
(in other words do not contain any named graph inside).


I can't find a way to do that.

tdb2:unionDefaultGraph applies to a single datasets and you have two 
datasets.


Using
  ja:defaultGraph [
a ja:Model;
ja:subModel ...
ja:subModel ...
] ;

falls foul of transaction coordination across two different models (even 
if they are views of the same database).


I though that would work - there is some attempt to extend transactions 
into graph but this seems to be pushing things too far.


Andy


Re: Delete Dataset with fuseki

2023-11-16 Thread Andy Seaborne




On 15/11/2023 09:19, Steven Blanchard wrote:

Dear Jena Users,

When i delete a dataset with fuseki, only the configuration file are 
removing and not the tdb2 folder
According to the documentation this is expected behaviour  : 

But I have a problem with this when I want to recreate a dataset with 
the name, the old data are still here.


How can I delete the tdb2 dataset with fuseki by interface or API?


Currently, that isn't possible.

You can delete the folder via the OS.

Feel free to raise an issue for a new feature. We can add a query string 
item and do the same as compact which hs a ?deleteOld flag.


However, there is also the issue in the general case that other 
operations maybe be using the database concurrently. Maybe renaming 
aside is better - already started requests will finish cleanly.


Note that on MS Windows, it isn't possible to free the space.  It is a 
JVM feature on MS Windows that memory mapped files do not go away until 
the JVM exists. This is a long-standing issue of Java.


Andy



Thanks,

Steven




Re: Issues importing Jena to Eclipse after clean install

2023-11-16 Thread Andy Seaborne

https://stackoverflow.com/questions/77490993/importing-jena-to-eclipse-compile-problems

On 15/11/2023 21:35, Paul Jarski wrote:


I believe I've followed the instructions from 
https://jena.apache.org/tutorials/using_jena_with_eclipse.html: I ran 
mvn clean install with apparently no issues, but then when I tried to 
import the maven project to Eclipse, I had 271 errors, mostly 
pertaining to the Graph type, which didn't compile somehow. I can't 
seem to find advice on how to resolve this issue anywhere. Any 
suggestions? Thanks in advance!


Screenshot from 2023-11-15 13-25-54.png



Re: Semantics of SPARQL Update Delete

2023-11-10 Thread Andy Seaborne




On 10/11/2023 20:35, Marco Neumann wrote:

On Fri, Nov 10, 2023 at 5:51 PM Andy Seaborne  wrote:




On 10/11/2023 12:33, Marco Neumann wrote:

Should DELETE {URI URI * } not update all matching graph patterns?


No.
(and that's bad syntax)


I had a case where only DELETE {URI URI NODE } did execute the update in
the dataset/graph/query fuseki UI.

To be precise it is a DELETE INSERT combination with an empty WHERE

clause.


DELETE {pattern} INSERT{pattern} WHERE{ }


the "pattern" is used as a template.
DELETE {template} INSERT {template} WHERE {pattern}

If the template has variables, these variables must be set by the WHERE
clause. Otherwise triple patterns with unbound variables are skipped.



OK, yes I think this is my case, an unbound variable was used in the
template, the "Update Success" tricked me into believing that the data was
actually removed.


"Update Success" means "executed as per spec" :-)

It's the same rule as CONSTRUCT which skips triples with any unbound 
variables.


Andy



There is no pattern matching  in a template.

There is a short form DELETE WHERE { pattern } which is
DELETE { pattern } WHERE {pattern}, using the pattern as the template.

  Andy



Marco








Re: Semantics of SPARQL Update Delete

2023-11-10 Thread Andy Seaborne




On 10/11/2023 18:19, Marco Neumann wrote:

On Fri, Nov 10, 2023 at 5:51 PM Andy Seaborne  wrote:




On 10/11/2023 12:33, Marco Neumann wrote:

Should DELETE {URI URI * } not update all matching graph patterns?


No.
(and that's bad syntax)



DELETE {  ?x } is bad syntax?


"*" is bad syntax.

DELETE {  ?x } is bad syntax for another reason - there must 
be a WHERE.






I had a case where only DELETE {URI URI NODE } did execute the update in
the dataset/graph/query fuseki UI.

To be precise it is a DELETE INSERT combination with an empty WHERE

clause.


DELETE {pattern} INSERT{pattern} WHERE{ }


the "pattern" is used as a template.
DELETE {template} INSERT {template} WHERE {pattern}

If the template has variables, these variables must be set by the WHERE
clause. Otherwise triple patterns with unbound variables are skipped.

There is no pattern matching  in a template.

There is a short form DELETE WHERE { pattern } which is
DELETE { pattern } WHERE {pattern}, using the pattern as the template.

  Andy



Marco








Re: Semantics of SPARQL Update Delete

2023-11-10 Thread Andy Seaborne




On 10/11/2023 12:33, Marco Neumann wrote:

Should DELETE {URI URI * } not update all matching graph patterns?


No.
(and that's bad syntax)


I had a case where only DELETE {URI URI NODE } did execute the update in
the dataset/graph/query fuseki UI.

To be precise it is a DELETE INSERT combination with an empty WHERE clause.

DELETE {pattern} INSERT{pattern} WHERE{ }


the "pattern" is used as a template.
DELETE {template} INSERT {template} WHERE {pattern}

If the template has variables, these variables must be set by the WHERE 
clause. Otherwise triple patterns with unbound variables are skipped.


There is no pattern matching  in a template.

There is a short form DELETE WHERE { pattern } which is
DELETE { pattern } WHERE {pattern}, using the pattern as the template.

Andy



Marco



Re: Ever-increasing memory usage in Fuseki

2023-11-02 Thread Andy Seaborne

Hi Hugo,

On 01/11/2023 19:43, Hugo Mills wrote:

Hi,

We’ve got an application we’ve inherited recently which uses a Fuseki 
database. It was originally Fuseki 3.4.0, and has been upgraded to 4.9.0 
recently. The 3.4.0 server needed regular restarts (once a day) in order 
to keep working; the 4.9.0 server is even more unreliable, and has been 
running out of memory and being OOM-killed multiple times a day. This 
afternoon, it crashed enough times, fast enough, to make Kubernetes go 
into a back-off loop, and brought the app down for some time.


We’re using OpenJDK 19. The JVM options are: “-Xmx:30g -Xms18g”, and the 
container we’re running it in has a memory limit of 31 GiB.


Setting Xmx close to the container limit can cause problems.

The JVM itself takes space and the operating system needs space.
The JVM itself has a ~1G extra space for direct memory which networking 
uses.


The Java heap will almost certainly grow to reach Xmx at some point 
because Java delays running full garbage collections. The occasional 
drops you see are likely incremental garbage collections happening.


If Xxm is very close to container limit, the heap will naturally grow 
(it does not know about the container limit),  then the total in-use 
memory for the machine is reached and the container is killed.


30G heap looks like a very tight setting. Is there anything customized 
running in Fuseki? is the server dedicated to Fuseki?


As Conal mentioned, TDB used memory mapped files - these are not part of 
the heap. They are part of the OS virtual memory.


Is this a single database?
One TDB database needs about 4G RAM of heap space. Try a setting of -Xmx4G.

Only if you have a high proportion of very large literals will that 
setting not work.


More is not better from TDB's point of view.  Space for memory mapped 
files is handled elsewhere, and that space that will expand and contract 
as needed. If that space is squeezed out the system will slow down.


We tried the 
“-XX:+UserSerialGC” option this evening, but it didn’t seem to help 
much. We see the RAM usage of the java process rising steadily as 
queries are made, with occasional small, but insufficient, drops.



The store is somewhere around 20M triples in size.


Is this a TDB database or in-memory? (I'm guessing TDB but could you 
confirm that.)


Query processing can lead to a lot of memory use if the queries are 
inefficient and there is a high, overlapping query load.


What is the query load on the server? Are there many overlapping requests?

Could anyone suggest any tweaks or options we could do to make this more 
stable, and not leak memory? We’ve downgraded to 3.4.0 again, and it’s 
not running out of space every few minutes at least, but it still has an 
ever-growing memory usage.


Thanks,

Hugo.

*Dr. Hugo Mills*

Senior Data Scientist

hugo.mi...@agrimetrics.co.uk 


[ANN] Apache Jena 4.10.0

2023-11-01 Thread Andy Seaborne



The Apache Jena development community is pleased to
announce the release of Apache Jena 4.10.0

In this release:

* Prepare for Jena5

  Check use of deprecated API calls
These are largely being removed in Jena5.

  Jena5 will require Java17

  jena5 Fuseki will switch from javax.servlet to jakarta.servlet
This will require use of Apache Tomcat 10 to run the WAR file.

  jena-jdbc is planned for retirement in Jena 5.0.0

See the Jena5 label in the github issues area:

https://github.com/apache/jena/issues?q=is%3Aissue+label%3Ajena5

* Development will switch to Jena5.
  The 'main' branch is now for Jena5 development.
  There is a branch 'jena4' marking the 4.10.0 release

== Notes

All issues: https://s.apache.org/jena-4.10.0-issues

There is a CHANGES.txt in the root of the repository
with the history of announcement messages.

 Contributions:

Shawn Smith
"Race condition with QueryEngineRegistry and
UpdateEngineRegistry init()"
  https://issues.apache.org/jira/browse/JENA-2356

Ali Ariff
"Labeling for Blank Nodes Across Writers"
  https://github.com/apache/jena/issues/1997

sszuev
"jena-core: add more javadocs about Graph-mem thread-safety and 
ConcurrentModificationException"

  https://github.com/apache/jena/pull/1994

sszuev
GH-1419: fix DatasetGraphMap#clear
  https://github.com/apache/jena/issue/1419

sszuev
GH-1374: add copyWithRegisties Context helper method
  https://github.com/apache/jena/issue/1374


All issues in this release:
https://s.apache.org/jena-4.10.0-issues

 Key upgrades

org.apache.lucene : 9.5.0 -> 9.7.0
org.apache.commons:commons-lang3: 3.12.0 -> 3.13.0
org.apache.sis.core:sis-referencing : 1.1 -> 1.4

== Obtaining Apache Jena 4.10.0

* Via central.maven.org

The main jars and their dependencies can used with:

  
org.apache.jena
apache-jena-libs
pom
4.10.0
  

Full details of all maven artifacts are described at:

http://jena.apache.org/download/maven.html

* As binary downloads

Apache Jena libraries are available as a binary distribution of
libraries. For details of a global mirror copy of Jena binaries please see:

http://jena.apache.org/download/

* Source code for the release

The signed source code of this release is available at:

http://www.apache.org/dist/jena/source/

and the signed master source for all Apache Jena releases is available
at: http://archive.apache.org/dist/jena/

== Contributing

If you would like to help out, a good place to look is the list of
unresolved JIRA at:

https://https://github.com/apache/jena/issuesissues-current

or review pull requests at

https://github.com/apache/jena/pulls

or drop into the dev@ list.

We use github pull requests and other ways for accepting code:
 https://github.com/apache/jena/blob/master/CONTRIBUTING.md


Re: HTTP QueryExecution has been closed

2023-10-29 Thread Andy Seaborne

It's not clear from the information so far.

Complete, minimal, verifiable example please.

Also - what's the stacktrace you are seeing and which Jena version are 
you running?


Andy

On 27/10/2023 22:18, Martynas Jusevičius wrote:

Hi,

I'm trying to understand in which circumstances can the following code

 try (QueryExecution qex = QueryExecution.create(getQuery(), rowModel))
 {
 return qex.execConstructDataset();
 }

throw the "HTTP QueryExecution has been closed" exception?
Full code here:
https://github.com/AtomGraph/LinkedDataHub/blob/rf-direct-graph-ids-only/src/main/java/com/atomgraph/linkeddatahub/imports/stream/csv/CSVGraphStoreRowProcessor.java#L141

The execution is not even happening over HTTP? Is it somehow closed prematurely?

I can see the exception being thrown in QueryExecDataset::constructQuads:
https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/sparql/exec/QueryExecDataset.java#L211

Martynas


Re: qparse but preserve comments

2023-10-29 Thread Andy Seaborne




On 27/10/2023 17:58, Justin wrote:

Hello,

Is it possible to allow qparse to preserve comments?


No - the parser skips them.



e.g. it currently does not:
```
justin@parens:/tmp$ cat a.rq
select * where {
?s ?p ?o
# comment here
}
justin@parens:/tmp$ ~/Downloads/apache-jena-4.7.0/bin/qparse --query a.rq
SELECT  *
WHERE
   { ?s  ?p  ?o }
```

If comment lines (starting with #) are too tricky to work with what about
something like Clojure's `comment`
e.g.

```
(comment (range 5))
```
That way the "comment" is still part of the AST tree.
Maybe there could be a magic predicate that gets ignored?
```
[] ex:comment "comment here" .
```



If the comment is between syntax elements, then, yes, an 
"ElementComment" could be added.


Some positions are within syntax elements:

 { ?s
   # Some comment
   ?p  ?o }

and don't have a natural place to go in the AST.

Andy

c.f. Pragmas.


Re: How to reconstruct a Literal from a SPARQL SELECT row element?

2023-10-26 Thread Andy Seaborne




On 26/10/2023 10:17, Steve Vestal wrote:
What is the best way to reconstruct a typed Literal from a SPARQL SELECT 
result?


I have a SPARQL SELECT query issued against an OntModel in this way:

  QueryExecution structureRowsExec = 
QueryExecutionFactory.create(structureRowsQuery, owlOntModel);


Here are some example triples in the query:

   ?a2 
 ?dataVar1.
   ?a2 
 ?dataVar2.




Query results come back as the right RDF term kind.


The OntModel being queried was created using typed literals, e.g.,


     DataPropertyAssertion( struct:floatProperty struct:indivA2 
"123.456"^^xsd:float )
     DataPropertyAssertion( struct:dateTimeProperty struct:indivA2 
"2023-10-06T12:05:10Z"^^xsd:dateTime )


When I look at the ?dataVar1 and ?dataVar2 results in a row, I get 
things like:


  1
  stringB
  123.456
  2023-10-06T12:05:10Z


Are those are just the toString() presentation?
Or is your query returning strings?



What is a good way to reconstruct a typed Literal from the query 
results? 


RDFNode is the class for all RDF term types.

QuerySolution.get

and if you know they are literals:

QuerySolution.getLiteral

Is there a SPARQL option to show full typed literal strings? 
Something that can be added to the query?  A utility method that can 
identify the XSD schema simple data type when given a result value string?





Re: pellet version

2023-10-19 Thread Andy Seaborne




On 19/10/2023 14:46, Taras Petrenko wrote:

Hi,
I would like to know which Pellet implementation is the most consistent with 
Jena? or which one is currently used in Protege ?
Now I am using the openllet-jena, version 2.6.3:


To finf the version of jena to go with a release of openllet-jena, 
either check through the dependencies of your project or look in the POM 
for openllet-parent.


There are later versions of openllet-jena

https://repo1.maven.org/maven2/com/github/galigator/openllet/openllet-jena/

The version in the git repo is

2.6.6-SNAPSHOT

uses Jena 4.2.0

https://github.com/Galigator/openllet/blob/3abccbfc0eec54233590cd4149055b78351e374d/pom.xml#L88

so you could try building from source.

Andy





com.github.galigator.openllet
openllet-jena
2.6.3


But I noticed some Datatype conversion problems in there..

Thank you for your time and all the best

Taras



Dr. Taras Petrenko
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19, 70569 Stuttgart, Germany
Email: taras.petre...@hlrs.de



Re: In using RIOT I encounter the "64000" entity expansions error.

2023-10-13 Thread Andy Seaborne


On 12/10/2023 20:20, Steve Vestal wrote:
I couldn't resist trying https://purl.obolibrary.org/obo/foodon.owl as 
a stress test for what we are doing.  We're on Jena 4.5.0 and I'm getting


Not in RDF/XML format due to exception 
org.apache.jena.riot.RiotException [line: 110334, col: 72] Invalid 
byte 2 of 2-byte UTF-8 sequence.

("Not in RDF/XML format due to..." does not appear to be a Jena message)

At that location:

"...(/ˈærɪkə/ or /əˈriːkə/)..."
        ^
(This email is UTF-8)

Line/column for encoding problems aren't always right but it looks like 
it is here.


Works for me in 3.17.0, 4.5.0, 5.0.0-dev

JVM_ARGS="-DentityExpansionLimit=200" riot --validate --count foodon.owl


Could this be due to my Jena version or Eclipse or Windows or UTF-8?


Windows most likely.
It can happen if the data has been piped at the command line.

    Andy



On 10/12/2023 1:42 PM, Andy Seaborne wrote:

Thanks. It parses OK.

On Thu, 12 Oct 2023, 19:36 Jim Balhoff,  wrote:


On Oct 6, 2023, at 3:46 AM, Andy Seaborne  wrote:


On 28/06/2023 09:26, Damion Dooley wrote:

I’m using RIOT to parse a large food ontology in owl rdf/xml format.

Damion,

Is that data publicly available?

There's a new RDF/XML parser for Jena in the pipeline and I'd like to

try it out on real data.

Andy,

Damion is active in FOODON, so that may be the ontology to try:
http://obofoundry.org/ontology/foodon.html

The ontology is at https://purl.obolibrary.org/obo/foodon.owl

- Jim





Re: In using RIOT I encounter the "64000" entity expansions error.

2023-10-12 Thread Andy Seaborne
Thanks. It parses OK.

On Thu, 12 Oct 2023, 19:36 Jim Balhoff,  wrote:

> > On Oct 6, 2023, at 3:46 AM, Andy Seaborne  wrote:
> >
> >
> > On 28/06/2023 09:26, Damion Dooley wrote:
> >> I’m using RIOT to parse a large food ontology in owl rdf/xml format.
> >
> > Damion,
> >
> > Is that data publicly available?
> >
> > There's a new RDF/XML parser for Jena in the pipeline and I'd like to
> try it out on real data.
>
> Andy,
>
> Damion is active in FOODON, so that may be the ontology to try:
> http://obofoundry.org/ontology/foodon.html
>
> The ontology is at https://purl.obolibrary.org/obo/foodon.owl
>
> - Jim
>
>
>


Re: In using RIOT I encounter the "64000" entity expansions error.

2023-10-06 Thread Andy Seaborne



On 28/06/2023 09:26, Damion Dooley wrote:

I’m using RIOT to parse a large food ontology in owl rdf/xml format.


Damion,

Is that data publicly available?

There's a new RDF/XML parser for Jena in the pipeline and I'd like to 
try it out on real data.


Andy


Re: Encountering the error "org.apache.thrift.protocol.TProtocolException: Unrecognized type 0" in different scenarios

2023-09-28 Thread Andy Seaborne




On 28/09/2023 15:49, Jan Eerdekens wrote:

I have been looking into reproducing the error locally, but haven't been
able to as the LOAD commands that produced the error a couple of months ago
now kill my Rancher. With a lot of restarts and Rancher configuration
changes (Apple virtualization instead of QEMU and virtiofs volume mounts
instead of the default one) I was able to get the LOADs working again. This
was with Jena 4.9.0 and now the LOADs didn't produce the "unrecognized type
0" error anymore... and I was even able to issue more and bigger LOAD
commands than before in 4.7.0.


Good news!


So after getting that working, but not being able to successfully reproduce
the error, I decided to try it in Jena 4.7.0... but there again my Rancher
started failing when trying to do larger LOADs. So I wasn't able to
reproduce the "unrecognized type 0". We did however got it a bunch of times
on our TST environment in the last week (in a bunch of different
scenarios). So it definitely is still occurring and also for datasets that
were created in at least 4.8.0. Might it be a good idea to delete/recreate
all the datasets on the instance and see if it happens again?


Yes.

The error likely was cause silently at write time but only shows up at 
read time.



I also had a further chat with our OPS people to check if they have any
ideas about other processes that might be accessing Jena's files. The only
things we could come up with were:

- the EFS we're using uses encryption at rest
- we're not doing backups ourselves, but whatever EFS does for backup
related stuff is being used
- we're running a daily compact command to free up disk space.. but that is
an API call that we guess shouldn't be an issue?

So we're still at a bit of a loss how and why this is happening.

On Tue, 19 Sept 2023 at 23:05, Andy Seaborne  wrote:


Hi Jan,

Thanks for the update.

On 18/09/2023 19:49, Jan Eerdekens wrote:

Hi Andy,

Sorry for the late answer, but I was quite busy.

The database was as far as I can tell generated in version 4.7.0 and then
upgrades to 4.8.0 and 4.9.0 were done. Datasets were created (and some
deleted and created again) in all these versions.

The scenario that my colleague had currently isn't reproducible after he
deleted and created his dataset again. I'd have to retry the data loads

for

my load test scenario and see if that still triggers the issue (during

the

load tests many months ago that was a pretty simple scenario that always
ended in the error - but that definitely was done on version 4.7.0). I'll
try to execute that loading code again and see what happens and open a
Github issue if it is able to reliably produce the issue in 4.9.0.

We are running Jena in a k8s cluster on AWS and it uses EFS as a file
store.


In case its matter, EFS is not the fastest storage for a database.
Caching tends to hide this if the caches are holding enough of the
working set but the latency is quite high.


As far as I know we don't have anything configured ourselves that
would cause concurrent access, but I'll check with our OPS people to see

if

they can identify something on the OS level that might access the files

or

if they have setup a backup process. Currently we're only running 1 Jena
instance per environment.

regards,

Jan



On Wed, 30 Aug 2023 at 23:08, Andy Seaborne  wrote:


Hi Jan,

On 30/08/2023 14:58, Jan Eerdekens wrote:

Hi,

We've been evaluating an using Jena for about 1,5 years now, but are
recently running into a perplexing issue. In a lot of different

scenarios,

ways of using Jena, we are getting the exceptions like the one below:




Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized

type

0
at

org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:140)

~[fuseki-server.jar:4.8.0]



The different scenarios where it has happened are:

 - LOADing data into a dataset
 - compacting a dataset
 - querying a dataset

In all those case we've run into trouble and get an exception that
mentions *org.apache.jena.tdb2.TDBException:
NodeTableTRDF/Read* and *org.apache.thrift.protocol.TProtocolException:
Unrecognized type 0*.

What can cause this? This looks kinda similar to this mailing list
question,

https://www.mail-archive.com/users@jena.apache.org/msg20409.html,

where it seems data corruption is mentioned that potentially isn't
recoverable?

   >

The first time I encountered this issue was while doing a bunch of
sequential LOAD commands to prepare a large dataset for load testing. I
used files of around 50mb (started off with bigger ones) and after

about

20

to 25 LOADs it would get this error (also the completion time of a LOAD
would go up and up). So for this scenario I was running locally (Jena
Fuseki running in docker/Rancher) and only running the LOADs and not

much

else except for a SELECT here and there (via the Fuseki UI) to check

that

performance while LOADing. Is there a way that that could cause data
corru

Normalizing language tags

2023-09-28 Thread Andy Seaborne
An active issue in the RDF 1.2 Working Group is whether to mandate the 
syntactic form of language tags.


Currently, in RDF, it says that language tags are compared case 
insensitively and also that "Lexical representations of language tags 
MAY be converted to lower case." Its actually in RDF semantics as 
D-entailment.


The issue has come to prominence because of work on RDF canonicalization 
and hashing (RCH) which works on the syntax of graphs. Signing and 
Verifiable Credentials than rely on RCH. So syntax matters, not the value.


The language tags RFC 5646 (AKA BCP-47 which is a soft link to the 
current latest RFC on the subject) says that

"case distinctions do not carry meaning in language tags"

Canonicalization of Language Tags [2] is different and out of scope - it 
means use the preferred names, for example, for countries. That requires 
access to the global registry. It is not being considered by RDF 1.,2 WG.


A complication is that the RDF-defined preferred presentation of 
language tags is not the same as the RFC.  RDF says "lower case".


In the RFC, each subtag has a preferred form. It's "en-US" ,"en-Latn-US" 
... The preferred form normalization rules only need the language tag 
string. Different subtags are identified by length, or for the country 
part - by being first.


Jena defaults to treating language tags as given.
"abc"@fr and "abc"@FR are different RDF terms.

The Jena parsers have options to choose what to do.

  RDFParserBuilder.langTagLowerCase()
  RDFParserBuilder.langTagCanonical()


For Jena5:

1/ Do you think Jena should switch to one form?
   2a/ Should that be in parsers, setting default to a one form output?
   2b/ Or should all langtags get normalized as the node is created?

2/ Which is your preferred form for RDF 1.2?
   2a/ Lower case
   2b/ RFC-preferred form
   2c/ No change


If Jena changes to have a common format of language tags, persistent 
data that has language tags in it will have to be reloaded.


Jena has a LangTag parser, org.apache.jena.riot.web.LangTag.
The current codebase defers to Locale.Builder for normalization in the 
JDK but that can be intercepted if the JDK is insufficient without 
application involvement.


Andy


RFC 5646
https://datatracker.ietf.org/doc/html/rfc5646

[1] Formatting of language tags:
https://datatracker.ietf.org/doc/html/rfc5646#section-2.1.1

[2] Canonicalization of language tags
https://datatracker.ietf.org/doc/html/rfc5646#page-66

[3]
https://issues.apache.org/jira/browse/JENA-1384


Re: Literal term equality - cannonizalise / normalize

2023-09-28 Thread Andy Seaborne




On 25/09/2023 15:35, Arne Bernhardt wrote:

Hello,
in order to use the GraphMem2 graphs in Jena 4.9, we are planning to switch
to "literal term equality" in our projects.

Currently we are discussing the following two approaches:

1. simple RDF standard compatibility.
We treat object literal nodes like any other node. The term representation
is always preserved, and users of our API only need to know the RDF
standards.
Anyone inserting "true"^^boolean needs to know that this is not the same
(term) as "1"^^boolean.

2. uniform value representations
All incoming data is canonicalised / normalised.
Users of our API just need to know that if they enter "1"^^boolean, they
will get back "true"^^boolean.


Users don't often realise "1"^^xsd;boolean is legal. Ditto for canonical 
integers.


From what I have seen, it is unusual for users to write these 
non-canonical forms, even for integers.



Should or can we use some of the classes in the jena project for this
purpose?
(like org.apache.jena.riot.process.normalize.CanonicalizeLiteral,
*.NormalizeValue and/or *.NormalizeValue2)


That would work. There are StreamRDF ways to apply the transformation.

The parser framework has RDFParserBuilder.canonicalLiterals(true).


Do you have any opinion on the two approaches?


Just information for the general reader:
TDB, for other reasons, canonicalises XSD number, date/time and boolean 
literals.


Andy



Regards
   Arne



Re: Encountering the error "org.apache.thrift.protocol.TProtocolException: Unrecognized type 0" in different scenarios

2023-09-19 Thread Andy Seaborne

Hi Jan,

Thanks for the update.

On 18/09/2023 19:49, Jan Eerdekens wrote:

Hi Andy,

Sorry for the late answer, but I was quite busy.

The database was as far as I can tell generated in version 4.7.0 and then
upgrades to 4.8.0 and 4.9.0 were done. Datasets were created (and some
deleted and created again) in all these versions.

The scenario that my colleague had currently isn't reproducible after he
deleted and created his dataset again. I'd have to retry the data loads for
my load test scenario and see if that still triggers the issue (during the
load tests many months ago that was a pretty simple scenario that always
ended in the error - but that definitely was done on version 4.7.0). I'll
try to execute that loading code again and see what happens and open a
Github issue if it is able to reliably produce the issue in 4.9.0.

We are running Jena in a k8s cluster on AWS and it uses EFS as a file
store.


In case its matter, EFS is not the fastest storage for a database.
Caching tends to hide this if the caches are holding enough of the 
working set but the latency is quite high.



As far as I know we don't have anything configured ourselves that
would cause concurrent access, but I'll check with our OPS people to see if
they can identify something on the OS level that might access the files or
if they have setup a backup process. Currently we're only running 1 Jena
instance per environment.

regards,

Jan



On Wed, 30 Aug 2023 at 23:08, Andy Seaborne  wrote:


Hi Jan,

On 30/08/2023 14:58, Jan Eerdekens wrote:

Hi,

We've been evaluating an using Jena for about 1,5 years now, but are
recently running into a perplexing issue. In a lot of different

scenarios,

ways of using Jena, we are getting the exceptions like the one below:




Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized

type

0
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:140)
~[fuseki-server.jar:4.8.0]



The different scenarios where it has happened are:

- LOADing data into a dataset
- compacting a dataset
- querying a dataset

In all those case we've run into trouble and get an exception that
mentions *org.apache.jena.tdb2.TDBException:
NodeTableTRDF/Read* and *org.apache.thrift.protocol.TProtocolException:
Unrecognized type 0*.

What can cause this? This looks kinda similar to this mailing list
question,

https://www.mail-archive.com/users@jena.apache.org/msg20409.html,

where it seems data corruption is mentioned that potentially isn't
recoverable?

  >

The first time I encountered this issue was while doing a bunch of
sequential LOAD commands to prepare a large dataset for load testing. I
used files of around 50mb (started off with bigger ones) and after about

20

to 25 LOADs it would get this error (also the completion time of a LOAD
would go up and up). So for this scenario I was running locally (Jena
Fuseki running in docker/Rancher) and only running the LOADs and not much
else except for a SELECT here and there (via the Fuseki UI) to check that
performance while LOADing. Is there a way that that could cause data
corruption and the exception we're seeing?


"Unrecognized type 0" has come up in a couple of cases.

It means the node table is corrupt but the problem was caused silently
at some point in the past. The "Unrecognized type 0" exception happens
some time later (not a few seconds - either after a restart or a long
time of usage that has churned the node cache - possibly many months).

There have been some fixes around compaction that addressed bugs in this
area. This has been the most common problem.

Was this database originally create before 4.8.0?

If not, do you have a fixed scenario so that the situation can be
recreated for 4.9.0? Please raise a github issue for it.

Another situation is if another OS process interferes with the files
(container OS or host OS). What operating system is the host machine?

While TDB2 endeavours to protect against multiple copies of TDB running
the same files, that is imperfect if it is two containers and the
database is on a mounted docker volume used by two containers.

One other report seemed to be a backup process was running over the
files. We didn't get to the root cause of that one.

  Andy



regards,

Jan Eerdekens







Re: read-only Fuseki TDB2

2023-09-19 Thread Andy Seaborne
While in TDB2, read-transactions never write, there is a case where a 
read-only setup needs to do writes.


It happens only at start-up and only the first time after the database 
is written during setup.


If there is a journal file with outstanding changes, these are completed 
before the database is passed to the application. It means there was an 
abnormal termination during the last write transaction after the commit 
point. The timing window for that is quite small.


A simple way is to prepare a database is to do a query on it as a 
separate process after it's been written : e.g.

tdb2.tdbquery --loc DATABASE 'ASK{}'

TDB1 is different; read-transactions can do some completion actions. The 
journal is more heavily used.


Andy

On 18/09/2023 16:35, Jim Balhoff wrote:

On Sep 18, 2023, at 11:09 AM, Andy Seaborne  wrote:



On 18/09/2023 15:35, Jim Balhoff wrote:

Thanks, I think that’s basically what I’ve got. The only operation I have 
enabled is 'fuseki:query’. But Fuseki still complains if the filesystem is 
read-only.


The database is opened before the configuration is processed.

Also, there is only one "database" java object for each database location and 
something elsewhere may now or later open it for other operations.

This could be changed - file management is (should be!) centralized in the 
codebase.

   


Got it, thanks. It isn’t too big of a problem; it would just be convenient for 
some situations like scaling up multiple servers in Kubernetes.



Re: read-only Fuseki TDB2

2023-09-18 Thread Andy Seaborne




On 18/09/2023 15:35, Jim Balhoff wrote:

Thanks, I think that’s basically what I’ve got. The only operation I have 
enabled is 'fuseki:query’. But Fuseki still complains if the filesystem is 
read-only.


The database is opened before the configuration is processed.

Also, there is only one "database" java object for each database 
location and something elsewhere may now or later open it for other 
operations.


This could be changed - file management is (should be!) centralized in 
the codebase.


Andy






On Sep 18, 2023, at 10:03 AM, Martynas Jusevičius  
wrote:

This looks like the configuration that you need:
https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html#read-only-service

On Mon, Sep 18, 2023 at 2:43 PM Jim Balhoff  wrote:


Hi,

Is it possible to run a Fuseki server using a read-only TDB2 directory? I’d 
like to run a query-only SPARQL endpoint, no updates. However I get an 
exception at startup if the filesystem is read-only. Does Fuseki need to 
acquire the lock even if updates are turned off?

Thank you,
Jim





Re: read-only Fuseki TDB2

2023-09-18 Thread Andy Seaborne

The TDB2 database is open for general use (i.e. write).

You can get read-only by restricting the operations on the dataset to 
only query and the read functions of the Graph Store Protocol.


Being SPARQL, query operations and write (update) operations are 
separate by syntax and there are different parsers for each subset of 
the total SPARQL grammar.


Andy

On 18/09/2023 13:40, Jim Balhoff wrote:

Hi,

Is it possible to run a Fuseki server using a read-only TDB2 directory? I’d 
like to run a query-only SPARQL endpoint, no updates. However I get an 
exception at startup if the filesystem is read-only. Does Fuseki need to 
acquire the lock even if updates are turned off?

Thank you,
Jim



Re: Jena hangs on deleted files

2023-09-09 Thread Andy Seaborne
This situation could be related to the other issues you're reported 
(corrupted node tables) if some other Linux process,not nece3ssarily 
java) is accessing the files.


A process holding them open will stop them becoming recyclable by the OS.

Andy

On 08/09/2023 13:09, Mikael Pesonen wrote:

Just on a command line (dev system)

/usr/bin/java -Xmx8G -jar fuseki-server.jar --update --port 3030 
--config=../jena_config/fuseki_config.ttl



On 08/09/2023 11.47, Andy Seaborne wrote:

In a container? As a VM?

On 08/09/2023 07:36, Mikael Pesonen wrote:

We are using Ubuntu.

On Thu, 7 Sept 2023 at 16:33, Andy Seaborne  wrote:


Are the database files on a MS Windows filesystem?

There is a long-standing Java issue that memory mapped files on MS
Windows do not get freed until the JVM exists.

Various bugs in the OpenJDK bug database such as:

https://bugs.openjdk.org/browse/JDK-4715154

  Andy

On 07/09/2023 13:06, Mikael Pesonen wrote:


We used deleteOld param. The 50 gigs are ghost files that are deleted
but not released, that's what I meant by hanging on deleted files.
Restarting jena releases them and now for example freed 50 gigs of 
space.


On 07/09/2023 15.02, Øyvind Gjesdal wrote:

What does the content of the tdb2 folder look like?

I think compact by default never deletes the old data, but you have
parameters for making it delete the old content on completion.

`--deleteOld` can be supplied to the tdb2.tdbcompact command line 
tool

and
`?deleteOld=true` can be supplied to the administration api when 
calling

compact


https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html#compact


You can also delete  the Data- that isn't the latest one in the
database folder.

Best regards,
Øyvind

On Thu, Sep 7, 2023 at 1:33 PM Mikael Pesonen

wrote:

After a while 25 gigs of files on data folder becomes 80 gigs of 
disk
usage because Jena (4.6.1) doen't release files. Same with 
compact. Is

this fixed in newer versions?











Re: Jena hangs on deleted files

2023-09-08 Thread Andy Seaborne

In a container? As a VM?

On 08/09/2023 07:36, Mikael Pesonen wrote:

We are using Ubuntu.

On Thu, 7 Sept 2023 at 16:33, Andy Seaborne  wrote:


Are the database files on a MS Windows filesystem?

There is a long-standing Java issue that memory mapped files on MS
Windows do not get freed until the JVM exists.

Various bugs in the OpenJDK bug database such as:

https://bugs.openjdk.org/browse/JDK-4715154

  Andy

On 07/09/2023 13:06, Mikael Pesonen wrote:


We used deleteOld param. The 50 gigs are ghost files that are deleted
but not released, that's what I meant by hanging on deleted files.
Restarting jena releases them and now for example freed 50 gigs of space.

On 07/09/2023 15.02, Øyvind Gjesdal wrote:

What does the content of the tdb2 folder look like?

I think compact by default never deletes the old data, but you have
parameters for making it delete the old content on completion.

`--deleteOld` can be supplied to the tdb2.tdbcompact command line tool
and
`?deleteOld=true` can be supplied to the administration api when calling
compact


https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html#compact


You can also delete  the Data- that isn't the latest one in the
database folder.

Best regards,
Øyvind

On Thu, Sep 7, 2023 at 1:33 PM Mikael Pesonen

wrote:


After a while 25 gigs of files on data folder becomes 80 gigs of disk
usage because Jena (4.6.1) doen't release files. Same with compact. Is
this fixed in newer versions?









Re: Use rdf:ID in RDF/XML generated file

2023-09-07 Thread Andy Seaborne

On 07/09/2023 15:54, mbk wrote:


Hi!

We generate a RDF/XML file which has all its resources with the 
'rdf:about' attribute. We would like to replace this attribute with 'rdf:id'


Using apache-jena 3.17.0 We create ressource with 
model.createResource(uri,res) where uri is an UUID with prefix '_' 
(_04f2f0d3-10f4-4248-a7fc-fc8243ec7250) and res os a ressource from a 
vocabulary. The ressource node in generated file has rdf:about atribute. 
We like to have rdf:ID instead


Hi,

Is that a URI of <_:04f2f0d3-10f4-4248-a7fc-fc8243ec7250> or 
<_04f2f0d3-10f4-4248-a7fc-fc8243ec7250


The first is not a legal URI - the scheme name "_" isn't legal. It's not 
a blank node either because the argument string is interpreted as a URI.


The second is a relative URI which when used in RDF/XML will be resolved 
against the base URI.


You should use a full URI in a call to model.createResource.

rdf:ID (on nodes) will become an URI fragment and also be resolved.

rdf:ID="abc" is much the same as rdf:about="#abc".


Do you have an minimal example of what you are trying to achieve and 
what you currently get?


Is this related to:

https://github.com/apache/jena/issues/2007

> Using apache-jena 3.17.0

released 2020-11-25

For security reasons (including with RDF/XML), you should upgrade to 
Jena 4.9.0


Andy



Thanks




Re: Jena hangs on deleted files

2023-09-07 Thread Andy Seaborne

Are the database files on a MS Windows filesystem?

There is a long-standing Java issue that memory mapped files on MS 
Windows do not get freed until the JVM exists.


Various bugs in the OpenJDK bug database such as:

https://bugs.openjdk.org/browse/JDK-4715154

Andy

On 07/09/2023 13:06, Mikael Pesonen wrote:


We used deleteOld param. The 50 gigs are ghost files that are deleted 
but not released, that's what I meant by hanging on deleted files. 
Restarting jena releases them and now for example freed 50 gigs of space.


On 07/09/2023 15.02, Øyvind Gjesdal wrote:

What does the content of the tdb2 folder look like?

I think compact by default never deletes the old data, but you have
parameters for making it delete the old content on completion.

`--deleteOld` can be supplied to the tdb2.tdbcompact command line tool 
and

`?deleteOld=true` can be supplied to the administration api when calling
compact
https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html#compact

You can also delete  the Data- that isn't the latest one in the
database folder.

Best regards,
Øyvind

On Thu, Sep 7, 2023 at 1:33 PM Mikael Pesonen 


wrote:


After a while 25 gigs of files on data folder becomes 80 gigs of disk
usage because Jena (4.6.1) doen't release files. Same with compact. Is
this fixed in newer versions?





Re: Problem with federated queries

2023-08-31 Thread Andy Seaborne




On 31/08/2023 08:58, Simon Bin wrote:

On Wed, 2023-08-30 at 21:36 +0100, Andy Seaborne wrote:

The query editor in the UI is a 3rd party compoent (from
@zazuco/yasqe -
it has security bug fixes from the original). It is has a SPARQL 1.1
grammar engine which determines the synatx checking. It would benefit
from a contribution to update the parser. LATERAL is not implemented
by
several engines.


nb I made a most trivial PR for this on
https://github.com/TriplyDB/Yasgui/pull/217 so maybe Jena could vendor
it for Fuseki (obviously it doesn't make sense for a strict sparql 1.1
query editor).


I'm not sure how active TriplyDB/Yasgui is - the last commit was a year 
ago. That would be OK but have been some security issues raised against 
that code and they are fixed in


  https://github.com/zazuko/Yasgui

Andy



Re: Encountering the error "org.apache.thrift.protocol.TProtocolException: Unrecognized type 0" in different scenarios

2023-08-30 Thread Andy Seaborne

Hi Jan,

On 30/08/2023 14:58, Jan Eerdekens wrote:

Hi,

We've been evaluating an using Jena for about 1,5 years now, but are
recently running into a perplexing issue. In a lot of different scenarios,
ways of using Jena, we are getting the exceptions like the one below:




Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized type
0
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:140)
~[fuseki-server.jar:4.8.0]



The different scenarios where it has happened are:

   - LOADing data into a dataset
   - compacting a dataset
   - querying a dataset

In all those case we've run into trouble and get an exception that
mentions *org.apache.jena.tdb2.TDBException:
NodeTableTRDF/Read* and *org.apache.thrift.protocol.TProtocolException:
Unrecognized type 0*.

What can cause this? This looks kinda similar to this mailing list
question, https://www.mail-archive.com/users@jena.apache.org/msg20409.html,
where it seems data corruption is mentioned that potentially isn't
recoverable?

>

The first time I encountered this issue was while doing a bunch of
sequential LOAD commands to prepare a large dataset for load testing. I
used files of around 50mb (started off with bigger ones) and after about 20
to 25 LOADs it would get this error (also the completion time of a LOAD
would go up and up). So for this scenario I was running locally (Jena
Fuseki running in docker/Rancher) and only running the LOADs and not much
else except for a SELECT here and there (via the Fuseki UI) to check that
performance while LOADing. Is there a way that that could cause data
corruption and the exception we're seeing?


"Unrecognized type 0" has come up in a couple of cases.

It means the node table is corrupt but the problem was caused silently 
at some point in the past. The "Unrecognized type 0" exception happens 
some time later (not a few seconds - either after a restart or a long 
time of usage that has churned the node cache - possibly many months).


There have been some fixes around compaction that addressed bugs in this 
area. This has been the most common problem.


Was this database originally create before 4.8.0?

If not, do you have a fixed scenario so that the situation can be 
recreated for 4.9.0? Please raise a github issue for it.


Another situation is if another OS process interferes with the files 
(container OS or host OS). What operating system is the host machine?


While TDB2 endeavours to protect against multiple copies of TDB running 
the same files, that is imperfect if it is two containers and the 
database is on a mounted docker volume used by two containers.


One other report seemed to be a backup process was running over the 
files. We didn't get to the root cause of that one.


Andy



regards,

Jan Eerdekens



Re: Java 11 vs Java 17

2023-08-30 Thread Andy Seaborne




On 29/08/2023 12:26, Andy Seaborne wrote:


Which result format is this? JSON? XML?


Thanks - the fact the impact on JSON and XML results writers is 
suggestive that the difference is in that area.


Andy






No suggestion that our case is representative of any broader pattern.

Dave


     Andy


Re: Problem with federated queries

2023-08-30 Thread Andy Seaborne




On 30/08/2023 08:47, fano.rampar...@orange.com wrote:

Thank you Simon and Andy. The LATERAL clause solved my problem. Hopefully, it 
is implemented in fuseki (although the UI displays a warning telling that it 
doesn't know this token). In case I need more control over federated query, I 
will look deeper into the service_enhancer extension.
Thomas, this suggests that Wikidata implements correctly the VALUES clause.


The query editor in the UI is a 3rd party compoent (from @zazuco/yasqe - 
it has security bug fixes from the original). It is has a SPARQL 1.1 
grammar engine which determines the synatx checking. It would benefit 
from a contribution to update the parser. LATERAL is not implemented by 
several engines.


Andy



Orange Restricted

-Message d'origine-
De : Andy Seaborne 
Envoyé : mardi 29 août 2023 22:10
À : users@jena.apache.org
Objet : Re: Problem with federated queries

There is also the service enhancer

https://jena.apache.org/documentation/query/service_enhancer.html

which provides various ways to control federated query.

  Andy

On 29/08/2023 19:22, Simon Bin wrote:

You could use the "LATERAL" extension of Jena (not standard Sparql
1.1):

PREFIX wd: <http://www.wikidata.org/entity/> PREFIX owl:
<http://www.w3.org/2002/07/owl#> PREFIX ex: <http://example/> PREFIX
wdt:  <http://www.wikidata.org/prop/direct/>
PREFIX geof: <http://www.opengis.net/def/geosparql/function/>
SELECT *
WHERE {
{
  ?ParisWDID owl:sameAs ex:Paris .
  ?BordeauxWDID owl:sameAs ex:Bordeaux .
} LATERAL {
  SERVICE <https://query.wikidata.org/sparql> {
SELECT * {
  ?ParisWDID wdt:P625 ?ParisLoc .
  ?BordeauxWDID wdt:P625 ?BordeauxLoc .
  BIND(geof:distance(?ParisLoc,?BordeauxLoc) AS ?dist)
} LIMIT 1
  }
}
}

Cheers,

On Tue, 2023-08-29 at 16:28 +, fano.rampar...@orange.com wrote:



Orange Restricted

-Message d'origine-
De : Thomas Francart  ...
No, that's not true. This is a possible implementation for federated
querying (*"Implementers of SPARQL 1.1 Federated Query may use the
VALUES clause...")*, but this is transparent for you, you don't have
to use the VALUES clause yourself.



Unfortunately, there is no example in that document on how to use
it,



That's because you don't have to use it

  FR> you're right, I missed the "Implementers of..."

Don't do federated querying :-)
Try to use an http to debug the exact query that is being sent by
Jena to Wikidata, this will help you understand the problem. Or maybe
Jena has a parameter itself to debug the queries it sends to external
services ?

  FR> If nobody in the list provides a solution, I'll will run
two instances of fuseki. One "local" and one "remote" and I'll check
on the standard output of the "remote" if the "local" has issued a
query with the clause "VALUES" .


Orange Restricted

-Message d'origine-
De : RAMPARANY Fano INNOV/IT-S
Envoyé : mardi 29 août 2023 10:58
À : users@jena.apache.org
Objet : RE: Problem with federated queries

Thank you for pointing us the reason of the issue. However, it seems
that introducing the subquery first doesn't seem to work either.

I slightly modified the query you suggested to:

PREFIX wd: <http://www.wikidata.org/entity/> PREFIX owl: <
http://www.w3.org/2002/07/owl#> PREFIX ex: <http://example/> PREFIX
wdt:
<http://www.wikidata.org/prop/direct/>
PREFIX geof: <http://www.opengis.net/def/geosparql/function/>
SELECT *
WHERE {
    {
      SELECT ?ParisWDID ?BordeauxWDID
      WHERE {
    BIND (wd:Q90 AS ?ParisWDID)
    BIND (wd:Q1479 AS ?BordeauxWDID)
      }
    }
    SERVICE <https://query.wikidata.org/sparql> {
     ?ParisWDID wdt:P625 ?ParisLoc .
     ?BordeauxWDID wdt:P625 ?BordeauxLoc .
     BIND(geof:distance(?ParisLoc,?BordeauxLoc) AS ?dist)
    }
}

Because the variables ?ParisWDID and ?BordeauxWDID should hold the
Wikidata identifiers. But the target query should be:

PREFIX wd: <http://www.wikidata.org/entity/> PREFIX owl: <
http://www.w3.org/2002/07/owl#> PREFIX ex: <http://example/> PREFIX
wdt:
<http://www.wikidata.org/prop/direct/>
PREFIX geof: <http://www.opengis.net/def/geosparql/function/>
SELECT *
WHERE {
    {
      SELECT ?ParisWDID ?BordeauxWDID
      WHERE {
    ?ParisWDID owl:sameAs ex:Paris .
    ?BordeauxWDID owl:sameAs ex:Bordeaux .
      }
    }
    SERVICE <https://query.wikidata.org/sparql> {
     ?ParisWDID wdt:P625 ?ParisLoc .
     ?BordeauxWDID wdt:P625 ?BordeauxLoc .
     BIND(geof:distance(?ParisLoc,?BordeauxLoc) AS ?dist)
    }
}

As these identifiers are defined in the RDF graph and are not
supposed to be known when building the query.

Unfortunately, although they use subqueries, none of the two queries
work.
The error persists ☹

Fano

Re: Problem with federated queries

2023-08-29 Thread Andy Seaborne

There is also the service enhancer

https://jena.apache.org/documentation/query/service_enhancer.html

which provides various ways to control federated query.

Andy

On 29/08/2023 19:22, Simon Bin wrote:

You could use the "LATERAL" extension of Jena (not standard Sparql
1.1):

PREFIX wd: 
PREFIX owl: 
PREFIX ex: 
PREFIX wdt:  
PREFIX geof: 
SELECT *
WHERE {
   {
 ?ParisWDID owl:sameAs ex:Paris .
 ?BordeauxWDID owl:sameAs ex:Bordeaux .
   } LATERAL {
 SERVICE  {
   SELECT * {
 ?ParisWDID wdt:P625 ?ParisLoc .
 ?BordeauxWDID wdt:P625 ?BordeauxLoc .
 BIND(geof:distance(?ParisLoc,?BordeauxLoc) AS ?dist)
   } LIMIT 1
 }
   }
}

Cheers,

On Tue, 2023-08-29 at 16:28 +, fano.rampar...@orange.com wrote:



Orange Restricted

-Message d'origine-
De : Thomas Francart 
...
No, that's not true. This is a possible implementation for federated
querying (*"Implementers of SPARQL 1.1 Federated Query may use the
VALUES clause...")*, but this is transparent for you, you don't have
to use the VALUES clause yourself.



Unfortunately, there is no example in that document on how to use
it,



That's because you don't have to use it

 FR> you're right, I missed the "Implementers of..."

Don't do federated querying :-)
Try to use an http to debug the exact query that is being sent by
Jena to Wikidata, this will help you understand the problem. Or maybe
Jena has a parameter itself to debug the queries it sends to external
services ?

 FR> If nobody in the list provides a solution, I'll will run
two instances of fuseki. One "local" and one "remote" and I'll check
on the standard output of the "remote" if the "local" has issued a
query with the clause "VALUES" .


Orange Restricted

-Message d'origine-
De : RAMPARANY Fano INNOV/IT-S
Envoyé : mardi 29 août 2023 10:58
À : users@jena.apache.org
Objet : RE: Problem with federated queries

Thank you for pointing us the reason of the issue. However, it
seems that
introducing the subquery first doesn't seem to work either.

I slightly modified the query you suggested to:

PREFIX wd:  PREFIX owl: <
http://www.w3.org/2002/07/owl#> PREFIX ex:  PREFIX
wdt:

PREFIX geof: 
SELECT *
WHERE {
   {
     SELECT ?ParisWDID ?BordeauxWDID
     WHERE {
   BIND (wd:Q90 AS ?ParisWDID)
   BIND (wd:Q1479 AS ?BordeauxWDID)
     }
   }
   SERVICE  {
    ?ParisWDID wdt:P625 ?ParisLoc .
    ?BordeauxWDID wdt:P625 ?BordeauxLoc .
    BIND(geof:distance(?ParisLoc,?BordeauxLoc) AS ?dist)
   }
}

Because the variables ?ParisWDID and ?BordeauxWDID should hold the
Wikidata identifiers. But the target query should be:

PREFIX wd:  PREFIX owl: <
http://www.w3.org/2002/07/owl#> PREFIX ex:  PREFIX
wdt:

PREFIX geof: 
SELECT *
WHERE {
   {
     SELECT ?ParisWDID ?BordeauxWDID
     WHERE {
   ?ParisWDID owl:sameAs ex:Paris .
   ?BordeauxWDID owl:sameAs ex:Bordeaux .
     }
   }
   SERVICE  {
    ?ParisWDID wdt:P625 ?ParisLoc .
    ?BordeauxWDID wdt:P625 ?BordeauxLoc .
    BIND(geof:distance(?ParisLoc,?BordeauxLoc) AS ?dist)
   }
}

As these identifiers are defined in the RDF graph and are not
supposed
to be known when building the query.

Unfortunately, although they use subqueries, none of the two
queries work.
The error persists ☹

Fano


Orange Restricted

-Message d'origine-
De : Thomas Francart  Envoyé : lundi 28
août
2023 18:26 À : users@jena.apache.org Objet : Re: Problem with
federated queries

One typical problem is that the federated query might be executed
*before* the rest of the query.
So when you write

   SERVICE https://query.wikidata.org/sparql {
    ?ParisWDID wdt:P625 ?ParisLoc .
    ?BordeauxWDID wdt:P625 ?BordeauxLoc .
    BIND(geof:distance(?ParisLoc,?BordeauxLoc) AS ?dist)
   }

Then that part is sent to Wikidata *without any bindings of the
variables*, which is basically asking wikidata to return the
distance
between *all* pairs of entities in the database, resulting in a
timeout.
And this is why your second query works.

If you want to guarantee ordering of execution, use a subquery,
which
is logically executed first :

PREFIX wd: http://www.wikidata.org/entity/ PREFIX owl:
http://www.w3.org/2002/07/owl# PREFIX ex: http://example/ PREFIX
wdt:
http://www.wikidata.org/prop/direct/
PREFIX geof: http://www.opengis.net/def/geosparql/function/
SELECT *
WHERE {

{
   SELECT ?ParisWDID ?BordeauxWDID
   WHERE {
     BIND(ex:Paris AS ?ParisWDID )
     

Re: Java 11 vs Java 17

2023-08-29 Thread Andy Seaborne




On 29/08/2023 08:46, Dave Reynolds wrote:

Hi Andy,

On 27/08/2023 10:36, Andy Seaborne wrote:


On 25/08/2023 15:18, Dave Reynolds wrote: [1]
 > We've being testing some of our troublesome queries on 4.9.0 on java
 > 11 vs java 17 and see a 10-15% performance hit on java 17 (even after
 > we take control of the GC by forcing both to use the old parallel GC
 > instead of G1). No idea why, seems wrong! Makes us inclined to stick
 > with java 11 and thus jena 4.x series as long as we can.

Dave,

Is this 4.9.0 specific or across multiple Jena versions?


Seems to be multiple versions (at least 4.8.0 and 4.9.0), but not tested 
exhaustively.



Is G1 worse than the old parallel GC on Java17?


It is definitely worse on Java11 for a particular narrow type of query 
that is an issue for us. Believe the same is true on Java17 but haven't 
collected definitive data on this.


It may be possible to tune G1 to better match our particular test case 
but the testing and tuning is time consuming and the parallel GC does 
the trick.


Our aim was to replace a system running on 3.x era fuseki with a 4.x era 
one without significant loss of performance. Out of box there was a 20% 
hit. Switching GC reduced much of that, switching to java11 instead of 
17 brought us basically to parity - for this special case. This is a 
case where legitimate queries get close to the timeout threshold we run 
at, so a 20% performance drop is particularly visible in having 
currently working queries timeout on a newer version.


The query itself is trivial - return large numbers of resources (10k-1m) 
found by a simple lucene query along with a few (~15) properties of 
each. Performance in this case seems to be dominated by the time to 
render the large results stream rather than lucene or TDB query 
performance. So it makes some sense that in this specific case a GC 
tuned for throughput rather than pause time would help.


Which result format is this? JSON? XML?



No suggestion that our case is representative of any broader pattern.

Dave


Andy


Java 11 vs Java 17

2023-08-27 Thread Andy Seaborne



On 25/08/2023 15:18, Dave Reynolds wrote: [1]
> We've being testing some of our troublesome queries on 4.9.0 on java
> 11 vs java 17 and see a 10-15% performance hit on java 17 (even after
> we take control of the GC by forcing both to use the old parallel GC
> instead of G1). No idea why, seems wrong! Makes us inclined to stick
> with java 11 and thus jena 4.x series as long as we can.

Dave,

Is this 4.9.0 specific or across multiple Jena versions?
Is G1 worse than the old parallel GC on Java17?

Andy

[1]
https://lists.apache.org/thread/74b2xn46hjxw7b6gkw5g948kgffltvj5


Re: TDB2 Exception in initialisation

2023-08-25 Thread Andy Seaborne




On 25/08/2023 09:04, Enrico.Daga wrote:

Hi,

I am having a strange error while attempting to initialise a TDB2 dataset:

TDB2Factory.connectDataset(tdb2location);

tdb2location existing. However, I am getting the error below, which I don't 
understand. It seems related to some file system issue, but I can initialise 
and connect to TDB2 instances fine in other parts of the app, on the same host 
(my laptop, btw...).

Any clue?


Hi - which version is this ? SystemTDB.java:379 does not align with the 
current codebase (nor to the other stacktrace lines)


The only NPE possibility I see when it checks the system context. If 
initialization has a problem,


1/ Have you repacked the jars in anyway?

2/ Before the first call by any code into Jena, try calling 
JenaSystem.init. This forces it to happen in a more predicable way.

The automatic way is at-rick from class loader ordering.

3/ Before any jena code, set the init logging
 JenaSystem.DEBUG_INIT = true

You should see on stderr

  JenaInitLevel0   [0]
  InitJenaCore [10]
  InitRIOT [20]
  InitARQ  [30]
  InitTDB2 [42]
  InitPatch[60]
  InitRDFS [60]
  InitShacl[95]
  InitShex [96]

then a record of everythign called.

Andy



Thx!

Enrico

---

java.lang.ExceptionInInitializerError: null
  at 
org.apache.jena.tdb2.params.StoreParams.getDftStoreParams(StoreParams.java:121)
  at 
org.apache.jena.tdb2.store.TDB2StorageBuilder.build(TDB2StorageBuilder.java:91)
  at org.apache.jena.tdb2.sys.StoreConnection.make(StoreConnection.java:93)
  at 
org.apache.jena.tdb2.sys.StoreConnection.connectCreate(StoreConnection.java:61)
  at 
org.apache.jena.tdb2.sys.DatabaseOps.createSwitchable(DatabaseOps.java:96)
  at org.apache.jena.tdb2.sys.DatabaseOps.create(DatabaseOps.java:77)
  at 
org.apache.jena.tdb2.sys.DatabaseConnection.build(DatabaseConnection.java:103)
  at 
org.apache.jena.tdb2.sys.DatabaseConnection.lambda$make$0(DatabaseConnection.java:74)
  at 
java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
 ~[?:1.8.0_292]
  at 
org.apache.jena.tdb2.sys.DatabaseConnection.make(DatabaseConnection.java:74)
  at 
org.apache.jena.tdb2.sys.DatabaseConnection.connectCreate(DatabaseConnection.java:63)
  at 
org.apache.jena.tdb2.sys.DatabaseConnection.connectCreate(DatabaseConnection.java:54)
  at org.apache.jena.tdb2.DatabaseMgr.DB_ConnectCreate(DatabaseMgr.java:41)
  at 
org.apache.jena.tdb2.DatabaseMgr.connectDatasetGraph(DatabaseMgr.java:46)
  at org.apache.jena.tdb2.TDB2Factory.connectDataset(TDB2Factory.java:52)
  at org.apache.jena.tdb2.TDB2Factory.connectDataset(TDB2Factory.java:70)
[...]
Caused by: java.lang.NullPointerException
  at 
org.apache.jena.tdb2.sys.SystemTDB.determineFileMode(SystemTDB.java:379)
  at org.apache.jena.tdb2.sys.SystemTDB.fileMode(SystemTDB.java:357)
  at 
org.apache.jena.tdb2.params.StoreParamsConst.(StoreParamsConst.java:33)


--
Enrico Daga, PhD

www.enridaga.net | @enridaga

SPARQL Anything http://sparql-anything.cc
Polifonia http://polifonia-project.eu
SPICE http://spice-h2020.eu
Open Knowledge Graph http://data.open.ac.uk

Senior Research Fellow, Knowledge Media Institute, STEM Faculty
The Open University
Level 4 Berrill Building, Walton Hall, Milton Keynes, MK7 6AA
Direct: +44 (0) 1908 654887


Re: Transactions over http (fuseki)

2023-08-25 Thread Andy Seaborne




On 18/08/2023 07:38, Gaspar Bartalus wrote:

Hi,

We would like to execute queries (construct) and updates (insert/delete
data) in one transaction.
Very similar I think to what Andy described here:
https://github.com/w3c/sparql-dev/issues/83


Something that has not quite ever managed to bubble up my todo list!

By exposing transaction control, there will have to be some defensive 
features like timeouts because otherwise an errant client could disrupt 
a server.


Andy



Gaspar

On Thu, Aug 17, 2023 at 7:54 PM Marco Neumann 
wrote:


Caspar, while you are looking into the extensions how would you like to use
transactions in your use case?

Marco

On Thu, 17 Aug 2023 at 13:44, Gaspar Bartalus  wrote:


Hi Lorenz,

Thanks for the quick response. That sounds indeed very promising.
Looking forward to knowing more details about the fuseki extension
mechanism, or a transaction module in particular.

Gaspar

On Thu, Aug 17, 2023 at 9:17 AM Lorenz Buehmann <
buehm...@informatik.uni-leipzig.de> wrote:


Hi,

that is an open issue in the SPARQL standard and Andy already opened a
ticket [1] regarding this maybe w.r.t. an upcoming SPARQL 1.2

I think mixed query types are still not possible via standard Fuseki in
a single transaction, but indeed an extension like you're planning
should be possible. Andy is already working on a newer Fuseki extension
mechanism (it's basically already there) where you can plugin so-called
Fuseki modules. This would be the way I'd try to add this extension to
Fuseki.

Indeed, Andy knows better and can give you more specific code or
pointers - maybe he even has such a module or code part implemented
somewhere.


Regards,

Lorenz


[1] https://github.com/w3c/sparql-dev/issues/83

On 16.08.23 17:20, Gaspar Bartalus wrote:

Hi,

We’ve been using jena-fuseki to store and interact with RDF data by

running

queries over the http endpoints.
We are now facing the challenge to use transactional operations on

the

triple store, i.e. running multiple sparql queries (both select and

update

queries) in a single transaction.
I would like to ask what your suggestion might be to achieve this.

The idea we have in mind is to extend jena-fuseki with new http

endpoints

for handling transactions.
Would this be technically feasible, i.e. could we reach the internal
transaction API (store API?) from jena-fuseki?
Would you agree with this approach conceptually, or would you

recommend

something different?

Thanks in advance,
Gaspar

PS: Sorry for the duplicate, I have the feeling that my other email

address

is blocked somehow.


--
Lorenz Bühmann
Research Associate/Scientific Developer

Email buehm...@infai.org

Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 |

04109

Leipzig | Germany





--


---
Marco Neumann





Re: RSIterator deprecation

2023-08-25 Thread Andy Seaborne




On 21/08/2023 19:19, Andrii Berezovskyi wrote:

Hello Andy,

Thank you for the pointer to ReifierStd, it seems to do the job as expected 
[1]. I hope we are using the new API correctly (I would be very thankful for 
any feedback).


ReifierStd is the code behind the Model reification - yes, that looks right.



While working on the changes, I re-discovered that Model.getResource() only accepts 
Strings and simply passing "node.getURI()" fails on blank nodes. I found that 
[2] from SPDX tools implements the logic I expect. Would such method make sense on a Jena 
Model (as in, to be added to the org.apache.jena.rdf.model.Model API)?



That would be add to add "Resource getResource(AnonId)" c.f. 
model.createResource(AnonId) ?


While Model.getResource does describe itself as legacy, it does seem 
like a more natural name than createResource


So yes, that does seem reasonable and not disruptive.

Andy



And I would like to thank you for your help and continued work on the Jena 
project!

–Andrew.

[1]: 
https://github.com/eclipse/lyo/pull/389/files?diff=split=1#diff-0fd9d8b08e29abbd461bbfb2cf63649a4c027397fb7f4161e8932a282ad42afcL952-R1014
[2]: 
https://github.com/spdx/tools/blob/bc35e25dacb728bf8332b82b509bb3efacd6c64e/src/org/spdx/rdfparser/SPDXDocument.java#L1225-L1235

On 16 Aug 2023, at 18:18, Andrii Berezovskyi  wrote:

Thanks Andy,

On 15 Aug 2023, at 10:12, Andy Seaborne  wrote:

That's quite a long method!

Yes, not proudest part of Lyo. To my shame, I was reluctant to refactor it 
since I took over the project. Looks like you are providing us with a necessary 
push.

What is reification used for in Lyo?
Do quoted triples provide the same capability?

Lyo provides support for implementing OASIS OSLC standards. To be brief, OSLC 
(Core) is both a predecessor and a superset to the W3C LDP. OSLC allows 
metadata to be placed on link triples using RDF 1.1 reification [1].

Assuming that by quoted triples you refer to RDF-star, yes, they provide the 
same capability and would fully cover OSLC needs. However, Lyo still needs to 
support what is standardized in OSLC. Additionally, I was under impression that 
RDF-star haven't reached W3C Recommendation status and OASIS discourages citing 
documents that haven't reached the standard/recommendation status in its 
standards-track specs. Regardless, OSLC standards try to maintain backwards 
compat, which means the committee will only vote to add RDF-star support to the 
standard (e.g., via a SHOULD clause) but not remove RDF 1.1 reification for 
backwards compatibility reasons.

Reification support is all calculation library code - it's a way to
present reification. It does not affect storage.

Jena2 had variations of reification which did impact storage - these
have not existed in releases for a long time.

The library that backs the reification support is ReifierStd and it will
be available in jena5. ReifierStd works on graphs, not models. Writing
companion code to provide the functionality for the Model API
equivalents would be possible. Contributions welcome.

RSIterator isn't necessary. It's a "typed next()" iterator that came
about in the pre-generics times.

I am not quite sure I understand the impact of the change. Do we need to 
implement any support if we need RDF 1.1 reification and are ready to switch to 
Graph and ReifierStd APIs, and drop the use of RSIterator?


To be clear : this is not RDF-star quoted triples.

To be clear from our side too, we are certainly interested in adopting RDF-star 
quoted triples in the future. We are not ready to drop RDF 1.1 reification 
until/unless RDF 1.2 drops it (when it reaches W3C Rec).


Also, do I understand correctly that the removal is planned for Jena 5.x after 
JDK 21 release?

Yes, sometime after Java21. Jena5 will require Java17, in line with
supporting two LTS versions of Java.

Jena is already built and tested on Java17 as part of our CI.  Users can
switch to java17 in deployments now. It is a bit faster and has Java
improvements and fixes not backported to Java11.

We've been testing Lyo on JDK 11, 17, 20, and 21-ea. We are ready for the JDK 
17 migration, it's just that some of the Lyo users use it in legacy 
environments, but we've communicated to them that Lyo will drop JDK 11 support 
when Jena does.

–Andrew.

[1]: https://oslc-op.github.io/oslc-specs/notes/link-guidance.html#Anchor



Re: Mystery memory leak in fuseki

2023-08-25 Thread Andy Seaborne




On 03/07/2023 14:20, Dave Reynolds wrote:
We have a very strange problem with recent fuseki versions when running 
(in docker containers) on small machines. Suspect a jetty issue but it's 
not clear.


From the threads here, it does seem to be Jetty related.

I haven't managed to reproduce the situation on my machine in any sort 
of predictable way where I can look at what's going on.



For Jena5, there will be a switch to a Jetty to use uses jakarta.* 
packages. That's no more than a rename of imports. The migration 
EE8->EE9 is only repackaging.  That's Jetty10->Jetty11.


There is now Jetty12. It is a major re-architecture of Jetty including 
it's network handling for better HTTP/2 and HTTP/3.


If there has been some behaviour of Jetty involved in the memory growth, 
it is quite unlikely to carried over to Jetty12.


Jetty12 is not a simple switch of artifacts for Fuseki. APIs have 
changed but it's a step that going to be needed sometime.


If it does not turn out that Fuseki needs a major re-architecture, I 
think that Jena5 should be based on Jetty12. So far, it looks doable.


Andy


Re: How to log the client ip address?

2023-08-15 Thread Andy Seaborne
If you turn on the NCSA logging, there is a standard format logging line 
output.


In the provided log4j2.properties, it is switched off by default:

logger.fuseki-request.level  = OFF

so comment out that line and a NCSA log line is generated which include 
X-Forwarded-For.


You can direct this to a separate file using log4j2.

For development/debugging, if you run with "-v", all the headers, for 
request and response, are printed out.


Andy

On 15/08/2023 08:21, Joachim Neubert wrote:

Hi everybody,

is there a way to log the client ip address (in my case, particularly 
the  X-Forwarded-For as delivered by the proxy) in the fuseki log file?


Cheers, Joachim



Re: RSIterator deprecation

2023-08-15 Thread Andy Seaborne




On 14/08/2023 19:46, Andrii Berezovskyi wrote:

Hello,

I just noticed that .listReifiedStatements()
ReifiedStatement, and RSIterator, which we use [1],


That's quite a long method!

What is reification used for in Lyo?
Do quoted triples provide the same capability?


have been deprecated in Jena. We looked though the javadocs and on the mailing 
list and didn't find any discussion of the migration guidance. [2] still 
mentions RSIterator and ReifiedStatement.


The website is still Jena4.

Could you please advise us on the best way to migrate from using these deprecated APIs? 


Reification support is all calculation library code - it's a way to 
present reification. It does not affect storage.


Jena2 had variations of reification which did impact storage - these 
have not existed in releases for a long time.


The library that backs the reification support is ReifierStd and it will 
be available in jena5. ReifierStd works on graphs, not models. Writing 
companion code to provide the functionality for the Model API 
equivalents would be possible. Contributions welcome.


RSIterator isn't necessary. It's a "typed next()" iterator that came 
about in the pre-generics times.


To be clear : this is not RDF-star quoted triples.


Also, do I understand correctly that the removal is planned for Jena 5.x after 
JDK 21 release?


Yes, sometime after Java21. Jena5 will require Java17, in line with 
supporting two LTS versions of Java.


Jena is already built and tested on Java17 as part of our CI.  Users can 
switch to java17 in deployments now. It is a bit faster and has Java 
improvements and fixes not backported to Java11.


Andy

https://lists.apache.org/thread/mk7qj43lwt17cnn6k1zxz7y0dom08gqs



Thanks in advance,
Andrew


[1]: 
https://github.com/eclipse/lyo/blob/a75b1945353cea5550cbd524c0e19da2ac4d4341/core/oslc4j-core/src/main/java/org/eclipse/lyo/oslc4j/provider/jena/JenaModelHelper.java#L952
[2]: https://jena.apache.org/documentation/notes/reification.html


JSON-LD framing

2023-08-15 Thread Andy Seaborne

Jena uses jsonld-java for JSON-LD 1.0 and titanium-json-ld for JSON-LD 1.1

Rob describes the framing support for passing information to jsonld-java 
for JSON-LD 1.0.


At Jena5, support for JSON-LD 1.0 is planned for removal. The 
jsonld-java project does not appear to be active.


The project is looking for a contribution to provide passing framing 
information to titanium-json-ld.


Andy

https://github.com/jsonld-java/jsonld-java
https://github.com/filip26/titanium-json-ld


On 14/08/2023 15:04, Rob @ DNR wrote:

Riot, and more generally Jena’s, configuration symbols are actually URIs 
internally, so the --set option needs to receive the full URI for the symbol, 
which I think should start with http://jena.apache.org/riot/jsonld#, not just 
the Java constant names as they appear in the examples/API.

Also, I don’t believe that any of these context options expect to receive a 
file, rather they expect to contain a chunk of JSON itself so from the command 
line you’d probably need something like the following:

$ export FRAME=$(cat frame.json)
$ riot --out JSONLD_FRAME_PRETTY --set 
“http://jena.apache.org/riot/jsonld#JSONLD_FRAME=$FRAME” input.ttl

NB – Completely untested, I don’t use JSON-LD myself at all so no guarantees 
any of this will work, but hopefully this at least points you in the right 
direction to make progress

Rob

From: Martin 
Date: Monday, 14 August 2023 at 12:45
To: users@jena.apache.org 
Subject: riot cmd convert RDF to JSON-LD framing
Hi,

I would like to convert RDF (on Turtle format) to JSON-LD and apply a
JSON-LD framing specification to it (*) -- and I would prefer to do
this with the command line tooling that ships with Jena.

I can transform my RDF to JSON-LD with the command

   $ riot --out=jsonld [file]

but I have not found a way to pass my context json file to the command.
Attempts like this fails or does not pick up the context file:

  $ riot --out=JSONLD_FRAME_PRETTY --set JSONLD_CONTEXT=[file] [file]

These attempts are motivated by
https://jena.apache.org/documentation/io/rdf-output.html#json-ld


Is there a way to pass a context file to riot, or otherwise achieve
what I want using Jena's command line tools? If not, what is my best
other option?

Thanks!

Martin

(*) Apologies if I am not using correct terminology for this.



Re: Help with incremental loading of TTL files

2023-07-26 Thread Andy Seaborne

Hi Robert,

On 26/07/2023 12:44, Robert Alexander wrote:

Dear friends,
I am using Apache Jena Fuseki from a Docker image 
https://hub.docker.com/r/secoresearch/fuseki and all works so well. Love 
Jena/Fuseki!

The problem I’m grappling with though is that after an initial mass loading 
from RDF serialised to Turtle TTL files, my jobs produce some more TTL every 
day and I need to ADD the ones ones to the graph.

I thought I could do the following:

docker exec bc89541add49 ./bin/s-put http://localhost:3030/mema_v5 'default' 
/fuseki-base/databases/mema_ttls/all/20230725.ttl


These are HTTP verbs. "put" means replace whatever data is at the URL.

"post" is the operation to add to the data.

Andy



But discovered by trial and error that sadly this REPLACES the original large 
graph with just the last file content.

Please suggest a solution to this rather inept medical doctor :) You all be 
happy and healthy. Cheers from Rome, Italy

Robert Alexander



Required Java version

2023-07-20 Thread Andy Seaborne



A reminder that Jena will move from requiring Java11 to requiring Java17.

The project aims to support for 2 LTS versions of Java.
Java21 is scheduled for September 19 this year and is LTS.

Some time after that date that Jena will switch, with a major version 
bump to Jena 5.x.y


For Fuseki, Jena5 will use jakarta.* packages.

Jena is already tested for Java17 and Java21_latest_EA on a regular 
basis. You can switch JVMs now.


Andy


Re: CVE-2023-32200

2023-07-20 Thread Andy Seaborne




On 20/07/2023 17:18, Brandon Sara wrote:


I just came across CVE-2023-32200 and was wondering, is it different than 
CVE-2023-22665 and, if so, how is it different?



Jena 4.8.0 addresses CVE-2023-22665 by requiring the Java system 
property "jena:scripting" to enable scripting.


Jena 4.9.0 addresses CVE-2023-32200 which happens if scripting is 
enabled (4.8.0). The change goes further than only addressing the 
security issue by requiring script functions to be in an "allowed" list; 
that is, there is an API contract for callable scripts. Other functions 
in the script file are not callable which should help development.


Running Java17 means there is no scripting engine unless the deployment
has added one. Java11 has a scriting engine in the JDK.

Andy


Re: Mystery memory leak in fuseki

2023-07-19 Thread Andy Seaborne

Conal,

Thanks for the information.
Can you see if metaspace is growing as well?

All,

Could someone please try running Fuseki main, with no datasets (--empty) 
with some healthcheck ping traffic.


Andy

On 19/07/2023 14:42, Conal McLaughlin wrote:

Hey Dave,

Thank you for providing an in depth analysis of your issues.
We appear to be witnessing the same type of problems with our current 
Fuseki deployment.
We are deploying a containerised Fuseki into a AWS ECS task alongside 
other containers - this may not be ideal but that’s a different story.


I just wanted to add another data point to everything you have described.
Firstly, it does seem like “idle” (or very low traffic) instances are 
the problem, for us (coupled with a larger heap than necessary).
We witness the same increase in the ECS task memory consumption up until 
the whole thing is killed off. Which includes the Fuseki container.


In an attempt to see what was going on beneath the hood, we turned up 
the logging to TRACE in the log4j2.xml file provided to Fuseki.

This appeared to stabilise the increasing memory consumption.
Even just switching the `logger.jetty.level` to TRACE alleviates the issue.


Colour me confused!

A Log4j logger that is active will use a few objects - may that's enough 
to trigger a minor GC which in turn is enough to flush some non-heap 
resources.


How big is the heap?
This is Java17?

We are testing this on Fuseki 4.8.0/TDB2 with close to 0 triples and 
extremely low query traffic / health checks via /ping.

KPk7uhH2F9Lp.png
ecs-task-memory - Image on Pasteboard 


pasteboard.co 




Cheers,
Conal

On 2023/07/11 09:31:25 Dave Reynolds wrote:
 > Hi Rob,
 >
 > Good point. Will try to find time to experiment with that but given the
 > testing cycle time that will take a while and can't start immediately.
 >
 > I'm a little sceptical though. As mentioned before, all the metrics we
 > see show the direct memory pool that Jetty uses cycling up the max heap
 > size and then being collected but with no long term growth to match the
 > process size growth. This really feels more like a bug (though not sure
 > where) than tuning. The fact that actual behaviour doesn't match the
 > documentation isn't encouraging.
 >
 > It's also pretty hard to figure what the right pool configuration would
 > be. This thing is just being asked to deliver a few metrics (12KB per
 > request) several times a minute but manages to eat 500MB of direct
 > buffer space every 5mins. So what the right pool parameters are to
 > support real usage peaks is not going to be easy to figure out.
 >
 > None the less you are right. That's something that should be explored.
 >
 > Dave
 >
 >
 > On 11/07/2023 09:45, Rob @ DNR wrote:
 > > Dave
 > >
 > > Thanks for the further information.
 > >
 > > Have you experimented with using Jetty 10 but providing more 
detailed configuration?Fuseki supports providing detailed Jetty 
configuration if needed via the --jetty-config option

 > >
 > > The following section look relevant:
 > >
 > > 
https://eclipse.dev/jetty/documentation/jetty-10/operations-guide/index.html#og-module-bytebufferpool

 > >
 > > It looks like the default is that Jetty uses a heuristic to 
determine these values, sadly the heuristic in question is not detailed 
in that documentation.

 > >
 > > Best guess from digging through their code is that the “heuristic” 
is this:

 > >
 > > 
https://github.com/eclipse/jetty.project/blob/jetty-10.0.x/jetty-io/src/main/java/org/eclipse/jetty/io/AbstractByteBufferPool.java#L78-L84

 > >
 > > i.e., ¼ of the configured max heap size.This doesn’t necessarily 
align with the exact sizes of process growth you see but I note the 
documentation does explicitly say that buffers used can go beyond these 
limits but that those will just be GC’d rather than pooled for reuse.

 > >
 > > Example byte buffer configuration at 
https://github.com/eclipse/jetty.project/blob/9a05c75ad28ebad4abbe624fa432664c59763747/jetty-server/src/main/config/etc/jetty-bytebufferpool.xml#L4

 > >
 > > Any chance you could try customising this for your needs with stock 
Fuseki and see if this allows you to make the process size smaller and 
sufficiently predictable for your use case?

 > >
 > > Rob
 > >
 > > From: Dave Reynolds 
 > > Date: Tuesday, 11 July 2023 at 08:58
 > > To: users@jena.apache.org 
 > > Subject: Re: Mystery memory leak in fuseki
 > > For interest[*] ...
 > >
 > > This is what the core JVM metrics look like when transitioning from a
 > > Jetty10 to a Jetty9.4 instance. You can see the direct buffer 
cycling up
 > > to 500MB (which happens to be the max heap setting) on Jetty 10, 
nothing

 > > on Jetty 9. The drop in Mapped buffers is just because TDB hadn't been
 > > asked any queries yet.
 > >
 > > 

Re: java.lang.Error: Maximum lock count exceeded

2023-07-17 Thread Andy Seaborne

Could you describe how you are using the database?

The stacktrace is consistent with application code not ending a read 
transaction.


Andy

On 16/07/2023 21:55, Jean-Marc Vanel wrote:

It is Jena 4.8.0 .

Jean-Marc Vanel
<http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
+33
(0)6 89 16 29 52


Le dim. 16 juil. 2023 à 21:53, Andy Seaborne  a écrit :


https://github.com/apache/jena/issues/1499

Are you are using 4.6.0?

  Andy

On 16/07/2023 14:54, Jean-Marc Vanel wrote:

Every few weeks, I get this stack, and the database is unusable

afterwards.

Is there some "purge" to call now and then ?

java.lang.Error: Maximum lock count exceeded
  at


java.base/java.util.concurrent.locks.ReentrantReadWriteLock$Sync.fullTryAcquireShared(ReentrantReadWriteLock.java:535)

  at


java.base/java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:494)

  at


java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1323)

  at


java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)

  at


org.apache.jena.dboe.storage.system.DatasetGraphTxnCtl.beginMultiMode(DatasetGraphTxnCtl.java:334)

  at


org.apache.jena.dboe.storage.system.DatasetGraphTxnCtl.tryNonExclusiveMode(DatasetGraphTxnCtl.java:252)

  at


org.apache.jena.dboe.storage.system.DatasetGraphTxnCtl.enterTransaction(DatasetGraphTxnCtl.java:110)

  at


org.apache.jena.dboe.storage.system.DatasetGraphTxnCtl.begin(DatasetGraphTxnCtl.java:74)

  at


org.apache.jena.dboe.storage.system.DatasetGraphTxnCtl.begin(DatasetGraphTxnCtl.java:99)

  at
org.apache.jena.sparql.core.DatasetImpl.begin(DatasetImpl.java:120)

Jean-Marc Vanel
<

http://semantic-forms.cc:1952/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me


+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr
   Chroniques jardin
<

http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle









Re: java.lang.Error: Maximum lock count exceeded

2023-07-16 Thread Andy Seaborne

https://github.com/apache/jena/issues/1499

Are you are using 4.6.0?

Andy

On 16/07/2023 14:54, Jean-Marc Vanel wrote:

Every few weeks, I get this stack, and the database is unusable afterwards.
Is there some "purge" to call now and then ?

java.lang.Error: Maximum lock count exceeded
 at
java.base/java.util.concurrent.locks.ReentrantReadWriteLock$Sync.fullTryAcquireShared(ReentrantReadWriteLock.java:535)
 at
java.base/java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:494)
 at
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1323)
 at
java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
 at
org.apache.jena.dboe.storage.system.DatasetGraphTxnCtl.beginMultiMode(DatasetGraphTxnCtl.java:334)
 at
org.apache.jena.dboe.storage.system.DatasetGraphTxnCtl.tryNonExclusiveMode(DatasetGraphTxnCtl.java:252)
 at
org.apache.jena.dboe.storage.system.DatasetGraphTxnCtl.enterTransaction(DatasetGraphTxnCtl.java:110)
 at
org.apache.jena.dboe.storage.system.DatasetGraphTxnCtl.begin(DatasetGraphTxnCtl.java:74)
 at
org.apache.jena.dboe.storage.system.DatasetGraphTxnCtl.begin(DatasetGraphTxnCtl.java:99)
 at
org.apache.jena.sparql.core.DatasetImpl.begin(DatasetImpl.java:120)

Jean-Marc Vanel

+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr
  Chroniques jardin




Re: error on tdb2.tdbbackup: "Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized type 0"

2023-07-14 Thread Andy Seaborne




On 14/07/2023 01:18, Jeffrey C. Witt wrote:

Hi Andy,

Thanks for your response.

Based on this comment...

It's not looking good for the database if /$/backup is falling. That's a
very simple use of the database.

Do you think the database is corrupted in such a way that it would be
better to just do a complete rebuild?


Yes, that is safer.

Andy



Many thanks,
jw

On Thu, Jul 13, 2023 at 3:55 PM Andy Seaborne  wrote:


Hi Jeff,

There were fixes to compaction in 4.6.0.

On 12/07/2023 23:53, Jeffrey C. Witt wrote:

Dear List,

I ran into an unusual error today when I tried to backup (and also

compact)

my TDB instance.

I first encountered the error when trying to compact and backup up using
fuseki 4.3.2

I ran both:


If you ran them at the same time, you may have trigger the problem that
was fixed in 4.6.0.



$ curl -XPOST http://localhost:3030/$/backup/ds
$ curl -XPOST http://localhost:3030/$/compact/ds

Both of these commands executed for while, filling up disk space, and

then

suddently stopped:

Eventually, I ran:

$ curl -XGET http://localhost:3030/$/status

and for both the compact and backup command, I received:

   "success": false (as seen in the example below)

[ {
  "finished" : "2014-05-28T13:54:13.608+01:00" ,
  "started" : "2014-05-28T13:54:03.607+01:00" ,


2014?


  "task" : "backup" ,
  "taskId" : "1" ,
  "success" : false
}
]


As I couldn't find any other message to help me diagnose the issue, I
stopped the running fuseki instance and tried to use the tdb2.tdbackup
command.

For this I used apache-jena-4.9.0 and I ran the following command

$ tdb2.tdbbackup --loc build

This command ran for a while, and I could see that it was writing to the
disk, but then it suddenly failed and gave me the following error

message.



...

*Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized
type 0*

...



(I am assuming that this error is the same reason the "compact" command
wasn't working.)


The problem would have happen on the failed compact, it just manifests
itself later on read.

(there is another way to cause the same problem - if some other process
touches database files)


I'm not really sure what's gone wrong. I've done the fuseki compact

command

several times without a problem.

Likewise, the Fuseki http server continues to be running well. It is
responding to all SPARQL GET requests as usual.

But as the database is growing (currently at 70G), and I need to be able

to

both back it up and compact it as it grows.

I would be most grateful for assistance or help diagnosing the issue.
Please let me know if I can provide more information.


It's not looking good for the database if /$/backup is falling. That's a
very simple use of the database.

You may be able to extract data using SPARQL.

Some data will be in the backup file (the tail of the file may be
managled but it's compressed n-quads so easy to text edit).

  Andy



Sincerely,

Jeff








Re: OOM Killed

2023-07-14 Thread Andy Seaborne

Hi Laura,

It hadn't occurred to me that the GC choice might be involved.
Also, I though G1 was the default GC but it seems at java11 it isn't 
that simple. It's build dependent.


I use the Ubuntu build of OpenJDK.

Java11 has Shenandoah
Java17 has G1 -- and I think java21 will be G1.

Could you try Java17?

And Jena will be moving to require java17. The project supports 2 LTS 
and Java21 in September is LTS.


Andy

On 14/07/2023 08:54, Laura Morales wrote:

Have you tried different garbage collectors?


WOAH I didn't even consider that before you mentioned it! I did this

 JVM_ARGS="-XX:+UseSerialGC -Xmx4G" ./fuseki-server ...

and RAM usage of the java process peaked at 12GB

 $ cat /proc/108344/status | grep VmHWM
 VmHWM: 11916368 kB

Unfortunately I'm not at all familiar with Java garbage collectors. I don't 
understand why this option would use 1/3 less RAM than the default GC.
What other options are available for a more aggressive GC? I'm more interested 
in reducing RAM usage than raw query performance.


Re: Dataset management API

2023-07-13 Thread Andy Seaborne




On 13/07/2023 21:09, Martynas Jusevičius wrote:

Andy,

Where are the dataset definitions created through the API persisted?


run/configuration


Are they merged with the datasets defined in the config file, or how
does it work?


--config and run/configuration contribute services. Avoid name clashes.

Andy



Martynas

On Sun, 2 Jul 2023 at 19.03, Andy Seaborne  wrote:




On 02/07/2023 13:23, Martynas Jusevičius wrote:

Hi,

Can I see an example of the data that needs to be POSTed to /$/datasets

in

order to create a new dataset+service?

The API is documented here but the data example is missing:


https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html#adding-a-dataset-and-its-services


I hope it’s the same data that is used in the config file?


the service part - or parameters dbType and dbname

ActionDatasets.java




https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html#defining-the-service-name-and-endpoints-available


Are there any practical limits on the number of datasets/services?


No.

Each database takes up some memory which is more than the management
information of a configuration.

  Andy



Thanks.

Martynas







Re: OOM Killed

2023-07-13 Thread Andy Seaborne
eatest hardware, but I think my 
database is very small and I feel like Fuseki should not be using 16GB RAM when 
running a lot of simple queries in series (not in parallel).
One thing that I want to try, but so far haven't, is to restart Fuseki halfway 
through the job.




Sent: Monday, July 10, 2023 at 1:18 PM
From: "Andy Seaborne" 
To: users@jena.apache.org
Subject: Re: OOM Killed

Laura, Dave,

This doesn't sound like the same issue but let's see.

Dave - your situation isn't under high load is it?

- Is it in a container? If so:
Is it the container being killed OOM or
  Java throwing an OOM exception?
Much RAM does the container get? How many threads?

- If not a container, how many CPU Threads are there? How many cores?

- Which form of Fuseki are you using?

what does
java -XX:+PrintFlagsFinal -version \
 | grep -i 'M..HeapSize`

say?

How are you sending the queries to the server?

On 09/07/2023 20:33, Laura Morales wrote:

I'm running a job that is submitting a lot of queries to a Fuseki server, in 
parallel. My problem is that Fuseki is OOM-killed and I don't know how to fix 
this. Some details:

- Fuseki is queried as fast as possible. Queries take around 50-100ms to 
complete so I think it's serving 10s of queries each second


Are all the queries about the same amount of work are are some going to
cause significantly more memory use?

It is quite possible to send queries faster than the server can process
them - there is little point sending in parallel more than there are
real CPU threads to service them.

They will interfere and the machine can end up going slower (query of
queries per second).

I don't know exactly the impact on the GC but I think the JVM delays
minor GC's when very busy but that pushes it to do major ones earlier.

A thing to try is use less parallelism.


- Fuseki 4.8. OS is Debian 12 (minimal installation with only OS, Fuseki, no 
desktop environments, uses only ~100MB of RAM)
- all the queries are read queries. No updates, inserts, or other write queries
- all the queries are over HTTP to the Fuseki endpoint
- database is TDB2 (created with tdb2.tdbloader)
- database contains around 2.5M triples
- the machine has 8GB RAM. I've tried on another PC with 16GB and it completes 
the job. On 8GB though, it won't
- with -Xmx6G it's killed earlier. With -Xmx2G it's killed later. Either way 
it's always killed.


Is it getting OOM at random or do certain queries tend to push it over
he edge?

Is that the machine (container) has 8G RAM and there is no -Xmx setting?
in that case, default setting applies which is 25% of RAM.

A heap dump to know where the memory is going would be useful.


Is there anything that I can tweak to avoid Fuseki getting killed? Something that isn't 
"just buy more RAM".
Thank you




Re: error on tdb2.tdbbackup: "Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized type 0"

2023-07-13 Thread Andy Seaborne

Hi Jeff,

There were fixes to compaction in 4.6.0.

On 12/07/2023 23:53, Jeffrey C. Witt wrote:

Dear List,

I ran into an unusual error today when I tried to backup (and also compact)
my TDB instance.

I first encountered the error when trying to compact and backup up using
fuseki 4.3.2

I ran both:


If you ran them at the same time, you may have trigger the problem that 
was fixed in 4.6.0.




$ curl -XPOST http://localhost:3030/$/backup/ds
$ curl -XPOST http://localhost:3030/$/compact/ds

Both of these commands executed for while, filling up disk space, and then
suddently stopped:

Eventually, I ran:

$ curl -XGET http://localhost:3030/$/status

and for both the compact and backup command, I received:

  "success": false (as seen in the example below)

[ {
 "finished" : "2014-05-28T13:54:13.608+01:00" ,
 "started" : "2014-05-28T13:54:03.607+01:00" ,


2014?


 "task" : "backup" ,
 "taskId" : "1" ,
 "success" : false
   }
]


As I couldn't find any other message to help me diagnose the issue, I
stopped the running fuseki instance and tried to use the tdb2.tdbackup
command.

For this I used apache-jena-4.9.0 and I ran the following command

$ tdb2.tdbbackup --loc build

This command ran for a while, and I could see that it was writing to the
disk, but then it suddenly failed and gave me the following error message.


...

*Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized
type 0*

...



(I am assuming that this error is the same reason the "compact" command
wasn't working.)


The problem would have happen on the failed compact, it just manifests 
itself later on read.


(there is another way to cause the same problem - if some other process 
touches database files)



I'm not really sure what's gone wrong. I've done the fuseki compact command
several times without a problem.

Likewise, the Fuseki http server continues to be running well. It is
responding to all SPARQL GET requests as usual.

But as the database is growing (currently at 70G), and I need to be able to
both back it up and compact it as it grows.

I would be most grateful for assistance or help diagnosing the issue.
Please let me know if I can provide more information.


It's not looking good for the database if /$/backup is falling. That's a 
very simple use of the database.


You may be able to extract data using SPARQL.

Some data will be in the backup file (the tail of the file may be 
managled but it's compressed n-quads so easy to text edit).


Andy



Sincerely,

Jeff



CVE-2023-32200: Apache Jena: Exposure of execution in script engine expressions.

2023-07-11 Thread Andy Seaborne
Severity: important

Affected versions:

- Apache Jena 3.7.0 through 4.8.0

Description:

There is insufficient restrictions of called script functions in Apache Jena
 versions 4.8.0 and earlier. It allows a 
remote user to execute javascript via a SPARQL query.
This issue affects Apache Jena: from 3.7.0 through 4.8.0.

Credit:

s3gundo of Alibaba (reporter)

References:

https://www.cve.org/CVERecord?id=CVE-2023-22665
https://jena.apache.org/
https://www.cve.org/CVERecord?id=CVE-2023-32200



Re: OOM Killed

2023-07-10 Thread Andy Seaborne




On 10/07/2023 12:18, Andy Seaborne wrote:

Laura, Dave,

This doesn't sound like the same issue but let's see.



Sorry for the confusion - these questions are for Laura.



- Is it in a container? If so:
   Is it the container being killed OOM or
     Java throwing an OOM exception?
   Much RAM does the container get? How many threads?

- If not a container, how many CPU Threads are there? How many cores?

- Which form of Fuseki are you using?

what does
   java -XX:+PrintFlagsFinal -version \
    | grep -i 'M..HeapSize`

say?

How are you sending the queries to the server?

On 09/07/2023 20:33, Laura Morales wrote:
I'm running a job that is submitting a lot of queries to a Fuseki 
server, in parallel. My problem is that Fuseki is OOM-killed and I 
don't know how to fix this. Some details:


- Fuseki is queried as fast as possible. Queries take around 50-100ms 
to complete so I think it's serving 10s of queries each second


Are all the queries about the same amount of work are are some going to 
cause significantly more memory use?


It is quite possible to send queries faster than the server can process 
them - there is little point sending in parallel more than there are 
real CPU threads to service them.


They will interfere and the machine can end up going slower (query of 
queries per second).


I don't know exactly the impact on the GC but I think the JVM delays 
minor GC's when very busy but that pushes it to do major ones earlier.


A thing to try is use less parallelism.

- Fuseki 4.8. OS is Debian 12 (minimal installation with only OS, 
Fuseki, no desktop environments, uses only ~100MB of RAM)
- all the queries are read queries. No updates, inserts, or other 
write queries

- all the queries are over HTTP to the Fuseki endpoint
- database is TDB2 (created with tdb2.tdbloader)
- database contains around 2.5M triples
- the machine has 8GB RAM. I've tried on another PC with 16GB and it 
completes the job. On 8GB though, it won't
- with -Xmx6G it's killed earlier. With -Xmx2G it's killed later. 
Either way it's always killed.


Is it getting OOM at random or do certain queries tend to push it over 
he edge?


Is that the machine (container) has 8G RAM and there is no -Xmx setting? 
in that case, default setting applies which is 25% of RAM.


A heap dump to know where the memory is going would be useful.

Is there anything that I can tweak to avoid Fuseki getting killed? 
Something that isn't "just buy more RAM".

Thank you


  1   2   3   4   5   6   7   8   9   10   >