Re: Java 21 support for Jena Fuseki 5.0.0

2024-04-24 Thread Rob @ DNR
Java versions are generally forwards compatible, so Fuseki should run fine on 
Java 21, unless any of our dependencies have some previously unreported issues 
with Java 21

If you do find any bugs then please file bugs as appropriate

Thanks,

Rob

From: Balduin Landolt 
Date: Wednesday, 24 April 2024 at 09:46
To: users@jena.apache.org 
Subject: Java 21 support for Jena Fuseki 5.0.0
Hi list,

me again... Does Jena Fuseki 5.0.0 support Java 21?
On https://jena.apache.org/download/ all I can see is "Jena5 requires Java
17".

Best,
Balduin


Re: Requesting advice on Fuseki memory settings

2024-03-21 Thread Rob @ DNR
Gaspar

This probably relates to https://access.redhat.com/solutions/2316

Deleting a file removes it from the file table but doesn’t immediately free the 
space if a process is still accessing those files.  That could be something 
else inside the container, or in a containerised environment where the disk 
space is mounted that could potentially be host processes on the K8S node that 
are monitoring the storage.

There’s some suggested debugging steps in the RedHat article about ways to 
figure out what processes might still be holding onto the old database files

Rob

From: Gaspar Bartalus 
Date: Wednesday, 20 March 2024 at 11:41
To: users@jena.apache.org 
Subject: Re: Requesting advice on Fuseki memory settings
Hi Andy

On Sat, Mar 16, 2024 at 8:58 PM Andy Seaborne  wrote:

>
>
> On 12/03/2024 13:17, Gaspar Bartalus wrote:
> > On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:
> >>
> >> On 11/03/2024 14:35, Gaspar Bartalus wrote:
> >>> Hi Andy,
> >>>
> >>> On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:
> >>>
> 
>  On 08/03/2024 10:40, Gaspar Bartalus wrote:
> > Hi,
> >
> > Thanks for the responses.
> >
> > We were actually curious if you'd have some explanation for the
> > linear increase in the storage, and why we are seeing differences
> >> between
> > the actual size of our dataset and the size it uses on disk. (Changes
> > between `df -h` and `du -lh`)?
>  Linear increase between compactions or across compactions? The latter
>  sounds like the previous version hasn't been deleted.
> 
> >>> Across compactions, increasing linearly over several days, with
> >> compactions
> >>> running every day. The compaction is used with the "deleteOld"
> parameter,
> >>> and there is only one Data- folder in the volume, so I assume
> compaction
> >>> itself works as expected.
>
> >> Strange - I can't explain that. Could you check that there is only one
> >> Data- directory inside the database directory?
> >>
> > Yes, there is surely just one Data- folder in the database directory.
> >
> >> What's the disk storage setup? e.g filesystem type.
> >>
> > We have an Azure disk of type Standard SSD LRS with a filesystem of type
> > Ext4.
>
> Hi Gaspar,
>
> I still can't explain what your seeing I'm afraid.
>
> Can we get some more details?
>
> When the server has Data-N -- how big (as reported by 'du -sh') is that
> directory and how big is the whole directory for the database. They
> should be nearly equal.


> When a compaction is done, and the server is at Data-(N+1), what are the
> sizes of Data-(N+1) and the database directory?
>

What we see with respect to compaction is usually the following:
- We start with the Data-N folder of ~210MB
- After compaction we have a Data-(N+1) folder of size ~185MB, the old
Data-N being deleted.
- The sizes of the database directory and the Data-* directory are equal.

However when we check with df -h we sometimes see that volume usage is not
dropping, but on the contrary, it goes up ~140MB after each compaction.

>
> Does stop/starting the server change those numbers?
>

Yes, then we start fresh where du -sh and df -h return the same numbers.

>
>  Andy
>


Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery

2024-03-19 Thread Rob @ DNR
You haven’t specified how your data is stored but assuming you are using Jena’s 
TDB/TDB2 then the triples/quads themselves are already indexed for efficient 
access.  It also inlines some value types that speeds up some comparisons and 
filters, including those used in simple ORDER BY expression as in your example.

This assumes that your objects for relations:hasUserCount triples are properly 
typed as xsd:integer or another well-known XSD numeric type, if not Jena is 
forced to fallback to more simplistic lexical string sorting which can be more 
expensive.

However, there is no indexing available for sorting because SPARQL allows for 
arbitrarily complex sort expressions, and the inputs to those expressions may 
themselves be dynamically computed values that don’t exist in the underlying 
dataset directly.

Rob

From: Chirag Ratra 
Date: Tuesday, 19 March 2024 at 10:39
To: users@jena.apache.org , Andy Seaborne 
, dcchabg...@gmail.com 
Subject: Re: [EXTERNAL] Re: Query Performance Degrade With Sorting In Subquery
Is there any way to create an index or something?

On Tue, Mar 19, 2024 at 3:46 PM Rob @ DNR  wrote:

> This is due to Jena’s lazy evaluation in its query engine.
>
> When you include a LIMIT clause on its own Jena only needs find the first
> N results (10 in your example) at which point it can abort any further
> processing and return results.  In this case evaluation is lazy.
>
> When you include LIMIT and ORDER BY clauses Jena has to find all possible
> results, sort them, and then return only the first N results.  In this case
> full evaluation is required.
>
> One possible approach might be to split into multiple queries i.e. do one
> query to get your main set of results, and then separately issue the
> related item sub-queries with concrete values substituted into for your
> ?concept and ?titleSkosXl values as while Jena will still need to do full
> evaluation injecting a concrete value will constrain the query evaluation
> further
>
> Hope this helps,
>
> Rob
>
> From: Chirag Ratra 
> Date: Tuesday, 19 March 2024 at 07:46
> To: users@jena.apache.org 
> Subject: Query Performance Degrade With Sorting In Subquery
> Hi,
>
> Facing a big performance degradation  while using sort query in subquery
> If I run query without sorting the response of my query is around 200 ms
> but when I use the order by query,  performance comes to be around 4-5
> seconds.
>
> Here is my query :
>
> PREFIX text: <http://jena.apache.org/text#<http://jena.apache.org/text>>
> PREFIX skos: <http://www.w3.org/2004/02/skos/core#<
> http://www.w3.org/2004/02/skos/core>><http://www.w3.org/2004/02/skos/core%3e%3e>
> PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#<
> http://www.w3.org/2008/05/skos-xl>><http://www.w3.org/2008/05/skos-xl%3e%3e>
> PREFIX relations: <https://cxdata.bold.com/ontologies/myDomain#<
> https://cxdata.bold.com/ontologies/myDomain>><https://cxdata.bold.com/ontologies/myDomain%3e%3e>
>
> SELECT ?concept ?titleSkosxl ?title ?languageCode (GROUP_CONCAT(DISTINCT
> ?relatedTitle; separator=", ") AS ?relatedTitles) (GROUP_CONCAT(DISTINCT
> ?alternate; separator=", ") AS ?alternates)
> WHERE
> {
>   (?titleSkosxl ?score) text:query ('cashier').
>
> ?concept skosxl:prefLabel ?titleSkosxl.
>   ?titleSkosxl skosxl:literalForm ?title.
>   ?titleSkosxl relations:usedInLocale ?controlledList.
>   ?controlledList relations:languageMarketCode ?languageCode
> FILTER(?languageCode = 'en-US').
>
>
> #  get alternate title
> OPTIONAL
>   {
> Select ?alternate  {
> ?concept skosxl:altLabel ?alternateSkosxl.
> ?alternateSkosxl skosxl:literalForm ?alternate;
>   relations:hasUserCount ?alternateUserCount.
> }
> ORDER BY DESC (?alternateUserCount) LIMIT 10
> }
>
> #  get related titles
>   OPTIONAL
>   {
>   Select ?relatedTitle
>   {
> ?titleSkosxl relations:isRelatedTo ?relatedSkosxl.
> ?relatedSkosxl skosxl:literalForm ?relatedTitle;
> relations:hasUserCount ?relatedUserCount.
>   }
> ORDER BY DESC (?relatedUserCount) LIMIT 10
>}
> }
> GROUP BY ?concept ?titleSkosxl ?title ?languageCode ?alternateJobTitle
> ?notation
> ORDER BY DESC(?jobtitleWeight) DESC(?score)
> LIMIT 10
>
> The sorting queries given causes huge performance degradation :
> ORDER BY DESC (?alternateUserCount) AND ORDER BY DESC (?relatedUserCount)
>
> How can this be improved, this sorting will be used in each and every query
> in my application.
>
> --
>
>
>
>
>
>
>
>
> This email may contain material that is confidential, privileged,
> or for the sole use of the intended reci

Re: Query Performance Degrade With Sorting In Subquery

2024-03-19 Thread Rob @ DNR
This is due to Jena’s lazy evaluation in its query engine.

When you include a LIMIT clause on its own Jena only needs find the first N 
results (10 in your example) at which point it can abort any further processing 
and return results.  In this case evaluation is lazy.

When you include LIMIT and ORDER BY clauses Jena has to find all possible 
results, sort them, and then return only the first N results.  In this case 
full evaluation is required.

One possible approach might be to split into multiple queries i.e. do one query 
to get your main set of results, and then separately issue the related item 
sub-queries with concrete values substituted into for your ?concept and 
?titleSkosXl values as while Jena will still need to do full evaluation 
injecting a concrete value will constrain the query evaluation further

Hope this helps,

Rob

From: Chirag Ratra 
Date: Tuesday, 19 March 2024 at 07:46
To: users@jena.apache.org 
Subject: Query Performance Degrade With Sorting In Subquery
Hi,

Facing a big performance degradation  while using sort query in subquery
If I run query without sorting the response of my query is around 200 ms
but when I use the order by query,  performance comes to be around 4-5
seconds.

Here is my query :

PREFIX text: >
PREFIX skos: 
>
PREFIX skosxl: 
>
PREFIX relations: 
>

SELECT ?concept ?titleSkosxl ?title ?languageCode (GROUP_CONCAT(DISTINCT
?relatedTitle; separator=", ") AS ?relatedTitles) (GROUP_CONCAT(DISTINCT
?alternate; separator=", ") AS ?alternates)
WHERE
{
  (?titleSkosxl ?score) text:query ('cashier').

?concept skosxl:prefLabel ?titleSkosxl.
  ?titleSkosxl skosxl:literalForm ?title.
  ?titleSkosxl relations:usedInLocale ?controlledList.
  ?controlledList relations:languageMarketCode ?languageCode
FILTER(?languageCode = 'en-US').


#  get alternate title
OPTIONAL
  {
Select ?alternate  {
?concept skosxl:altLabel ?alternateSkosxl.
?alternateSkosxl skosxl:literalForm ?alternate;
  relations:hasUserCount ?alternateUserCount.
}
ORDER BY DESC (?alternateUserCount) LIMIT 10
}

#  get related titles
  OPTIONAL
  {
  Select ?relatedTitle
  {
?titleSkosxl relations:isRelatedTo ?relatedSkosxl.
?relatedSkosxl skosxl:literalForm ?relatedTitle;
relations:hasUserCount ?relatedUserCount.
  }
ORDER BY DESC (?relatedUserCount) LIMIT 10
   }
}
GROUP BY ?concept ?titleSkosxl ?title ?languageCode ?alternateJobTitle
?notation
ORDER BY DESC(?jobtitleWeight) DESC(?score)
LIMIT 10

The sorting queries given causes huge performance degradation :
ORDER BY DESC (?alternateUserCount) AND ORDER BY DESC (?relatedUserCount)

How can this be improved, this sorting will be used in each and every query
in my application.

--








This email may contain material that is confidential, privileged,
or for the sole use of the intended recipient.  Any review, disclosure,
reliance, or distribution by others or forwarding without express
permission is strictly prohibited.  If you are not the intended recipient,
please contact the sender and delete all copies, including attachments.


Re: Problems when querying the SPARQL with Jena

2024-03-11 Thread Rob @ DNR
This looks like some system wide initialisation isn’t happening correctly in 
your runtime environment.

You can first try adding a JenaSystem.init() call as the first line in your 
main() method

Also, if you are building a fat/uber JAR via your build toolchain please be 
aware that Jena uses ServiceLoader for auto-discovery of system initialisation. 
 Please refer to https://jena.apache.org/documentation/notes/jena-repack.html 
for notes on how to ensure that doing so doesn’t break things like this.

Rob

From: Anna P 
Date: Monday, 11 March 2024 at 13:45
To: users@jena.apache.org 
Subject: Problems when querying the SPARQL with Jena
Dear Jena support team,

Currently I just started to work on a SPARQL project using Jena and I could
not get a solution when I query a model.
I imported a turtle file and ran a simple query, and the snippet code is
shown below. However, I got the error.

public class App {
public static void main(String[] args) {
try {
Model model = RDFDataMgr.loadModel('data.ttl', Lang.TURTLE);
RDFDataMgr.write(System.out, model, Lang.TURTLE);
String queryString = "SELECT * { ?s ?p ?o }";
Query query = QueryFactory.create(queryString);
QueryExecution qe = QueryExecutionFactory.create(query, model);
ResultSet results = qe.execSelect();
ResultSetFormatter.out(System.out, results, query);
qe.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}

Here is the error message:

org.apache.jena.riot.RiotException: Not registered as a SPARQL result set
output syntax: Lang:SPARQL-Results-JSON
at
org.apache.jena.sparql.resultset.ResultsWriter.write(ResultsWriter.java:179)
at
org.apache.jena.sparql.resultset.ResultsWriter.write(ResultsWriter.java:156)
at
org.apache.jena.sparql.resultset.ResultsWriter.write(ResultsWriter.java:149)
at
org.apache.jena.sparql.resultset.ResultsWriter$Builder.write(ResultsWriter.java:96)
at
org.apache.jena.query.ResultSetFormatter.output(ResultSetFormatter.java:308)
at
org.apache.jena.query.ResultSetFormatter.outputAsJSON(ResultSetFormatter.java:516)
at de.unistuttgart.ki.esparql.App.main(App.java:46)


Thank you for your time and help!

Best regards,

Pan


Re: jena-fuseki UI in podman execution

2024-02-08 Thread Rob @ DNR
Hi

This list does not permit attachments so we can’t see your screenshots, can you 
upload them to some public image hosting somewhere and link to them?

Thanks,

Rob

From: jaa...@kolumbus.fi 
Date: Thursday, 8 February 2024 at 08:48
To: users@jena.apache.org 
Subject: jena-fuseki UI in podman execution
Hi, I've running jena-fuseki with docker:

docker run -p 3030:3030 -e ADMIN_PASSWORD=pw123 stain/jena-fuseki

and rootless podman:

podman run -p 3030:3030 -e ADMIN_PASSWORD=pw123 docker.io/stain/jena-fuseki

when excuted the same version 4.8.0 of jena-fuseki with podman the UI looks 
totally different from the UI of the instance excuted with docker.

See attachement for the UI of podman excution.

What can cause this problem ?

Br, Jaana M






Re: question about FROM keyword

2024-02-05 Thread Rob @ DNR
So, there’s a couple of things happening here.

Firstly, Jena’s SPARQL engine always treats FROM (and FROM NAMED) as referring 
to graphs in the local dataset.  So, it doesn’t matter that the URL in your 
FROM is a valid RDF resource on the web, Jena won’t try and load that by 
default, it just looks for a graph with that URI in the local dataset.

Nothing in the SPARQL specifications requires that these URLs be treated 
otherwise.  Some implementations choose to resolve these URIs from the web but 
that isn’t required by the standard, and from a security standpoint isn’t a 
good idea.

Secondly, the ARQ command line tool the local dataset is usually an implicit 
empty dataset if you don’t supply one.  Except as it turns out when you supply 
a FROM/FROM NAMED, in which case it tries to build one given the inputs it has. 
 In this case that’s only your query file which isn’t valid when treated as an 
RDF dataset, thus you get the big nasty stack trace you reported.  (This 
specifically may be a bug in the arq tool)

You can avoid this second problem by supplying an empty data file e.g.

 arq --query query.rq --data empty.ttl

But that will only serve to highlight the first issue, that Jena only treats 
FROM/FROM NAMED as references to graphs in the local dataset, and you’ll get an 
empty result from your query.

You are better off downloading the RDF data you want to query locally and then 
running arq and supplying both a query file and a data file.

Hope this helps,

Rob

From: Zlatareva, Neli (Computer Science) 
Date: Monday, 5 February 2024 at 01:40
To: users@jena.apache.org 
Subject: question about FROM keyword
Hi there, I am trying the following arq query from command window
(works fine if I am getting the file locally)

PREFIX ab: 
>
SELECT ?last ?first ?courseName
FROM 
WHERE
{
  ?s ab:firstName ?first ;
 ab:lastName ?last ;
 ab:takingCourse ?course .
  ?course ab:courseTitle ?courseName .
}

I am getting the following error

D:\neli\cs575Spring24>arq --query ex070mod2.rq
ERROR StatusLogger Reconfiguration failed: No configuration found for 
'73d16e93' at 'null' in 'null'
org.apache.jena.riot.RiotException: Failed to determine the content type: 
(URI=file:///D:/neli/cs575Spring24/ex070mod2.rq : stream=text/plain)
at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:380)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:360)
at 
org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:570)
at org.apache.jena.riot.RDFDataMgr.parseFromURI(RDFDataMgr.java:737)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:193)
at 
org.apache.jena.sparql.util.DatasetUtils.addInGraphsWorker(DatasetUtils.java:200)
at 
org.apache.jena.sparql.util.DatasetUtils.lambda$addInGraphs$0(DatasetUtils.java:181)
at org.apache.jena.system.Txn.exec(Txn.java:77)
at org.apache.jena.system.Txn.executeWrite(Txn.java:125)
at 
org.apache.jena.sparql.util.DatasetUtils.addInGraphs(DatasetUtils.java:181)
at 
org.apache.jena.sparql.util.DatasetUtils.createDatasetGraph(DatasetUtils.java:153)
at 
org.apache.jena.sparql.util.DatasetUtils.createDatasetGraph(DatasetUtils.java:142)
at 
org.apache.jena.sparql.engine.QueryEngineBase.prepareDataset(QueryEngineBase.java:82)
at 
org.apache.jena.sparql.engine.QueryEngineBase.(QueryEngineBase.java:58)
at 
org.apache.jena.sparql.engine.main.QueryEngineMain.(QueryEngineMain.java:45)
at 
org.apache.jena.sparql.engine.main.QueryEngineMain$QueryEngineMainFactory.create(QueryEngineMain.java:89)
at 
org.apache.jena.sparql.exec.QueryExecDataset.getPlan(QueryExecDataset.java:514)
at 
org.apache.jena.sparql.exec.QueryExecDataset.startQueryIterator(QueryExecDataset.java:455)
at 
org.apache.jena.sparql.exec.QueryExecDataset.execute(QueryExecDataset.java:170)
at 
org.apache.jena.sparql.exec.QueryExecDataset.select(QueryExecDataset.java:164)
at 
org.apache.jena.sparql.exec.QueryExecutionAdapter.execSelect(QueryExecutionAdapter.java:117)
at 
org.apache.jena.sparql.exec.QueryExecutionCompat.execSelect(QueryExecutionCompat.java:99)
at 
org.apache.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:174)
at 
org.apache.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:106)
at arq.query.lambda$queryExec$0(query.java:239)
at org.apache.jena.system.Txn.exec(Txn.java:77)
at org.apache.jena.system.Txn.executeRead(Txn.java:115)
at arq.query.queryExec(query.java:236)
at arq.query.exec(query.java:159)
at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:87)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:56)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:43)

Re: Problem running AtomGraph/fuseki-docker

2023-12-08 Thread Rob @ DNR
Re: command prompt closure

That’s standard Docker behaviour.  Unless you tell it otherwise it runs your 
container in the foreground attached to your terminal.  Also, you’ve specified 
--rm which tells Docker to remove the container as soon as it exits, if you 
want to inspect logs after the container exits don’t include this option.

You probably wanted to add -d/--detach to run the container in the background, 
you can then follow the logs with the docker logs command or by viewing them in 
the Docker Desktop interface.

Rob

From: Steve Vestal 
Date: Wednesday, 6 December 2023 at 23:56
To: users@jena.apache.org 
Subject: Re: Problem running AtomGraph/fuseki-docker
I was using bash.  When I run it in command prompt, it works. Thanks!

Interestingly, when the command prompt is closed, the container is
removed from Docker Desktop.  Each new start creates a new container
with a new amusing name :-)

C:\Users\svestal>docker run --rm -p 3030:3030 atomgraph/fuseki --mem '/ds'
[2023-12-06 22:19:53] INFO  Server  :: Apache Jena Fuseki 4.6.1
[2023-12-06 22:19:53] INFO  Server  :: Database: in-memory
[2023-12-06 22:19:53] INFO  Server  :: Path = /'/ds'
[2023-12-06 22:19:53] INFO  Server  :: System
[2023-12-06 22:19:53] INFO  Server  ::   Memory: 2.0 GiB
[2023-12-06 22:19:53] INFO  Server  ::   Java:   17-ea
[2023-12-06 22:19:53] INFO  Server  ::   OS: Linux
5.15.133.1-microsoft-standard-WSL2 amd64
[2023-12-06 22:19:53] INFO  Server  ::   PID:1
[2023-12-06 22:19:53] INFO  Server  :: Start Fuseki (http=3030)

On 12/6/2023 2:12 PM, Martynas Jusevičius wrote:
> Hi Steve,
>
> This looks like Windows shell issue.
>
> For some reason /ds is resolved as a filepath where it shouldn’t.
>
> Can you try —mem '/ds' with quotes?
>
> I’m running Docker on WSL2 and never had this problem.
>
> Martynas
>
> On Wed, 6 Dec 2023 at 21.05, Steve Vestal  wrote:
>
>> I am running a VM with Microsoft Windows Server 2019 (64-bit). When I
>> try to stand up the docker server, I get
>>
>> $ docker run --rm -p 3030:3030 atomgraph/fuseki --mem /ds
>> String '/C:/Program Files/Git/ds' not valid as 'service'
>>
>> Suggestions?
>>
>>


Re: Query features info

2023-09-21 Thread Rob @ DNR
Hashim

I think what you want is probably --optimize=off and that should yield you the 
expected two BGPs e.g.


arq --optimize=off --query exampleQuery.sparql --explain


09:43:57 INFO  exec:: ALGEBRA

  (slice _ 100

(distinct

  (project (?name ?birth ?death)

(filter (< ?birth "1900-01-01")

  (leftjoin

(bgp

  (triple ?person http://dbpedia.org/ontology/birthPlace 
http://dbpedia.org/resource/Berlin)

  (triple ?person http://dbpedia.org/ontology/birthDate ?birth)

  (triple ?person http://xmlns.com/foaf/0.1/name ?name)

)

(bgp (triple ?person http://dbpedia.org/ontology/deathDate 
?death)))



As James noted query engines are free to apply optimisations to the raw algebra 
of the query provided that those optimisations preserve the semantics of the 
query.  Jena’s ARQ query engine contains many of these that have been developed 
over many years based on implementation experience, academic papers, applying 
well known query optimisation techniques etc.



Note that turning optimisations off (whatever engine you are using) is rarely a 
good idea.

You may want to experiment with the qparse command whose --print option allows 
you to ask to see various forms of the query from the perspective of Jena’a ARQ 
engine.  For example, the following shows the raw algebra and the optimised 
algebra in the same output.

qparse --query exampleQuery.sparql --print=op --print=optquad



(prefix ((dbo: http://dbpedia.org/ontology/)

 (dbr: http://dbpedia.org/resource/)

 (foaf: http://xmlns.com/foaf/0.1/))

  (slice _ 100

(distinct

  (project (?name ?birth ?death)

(filter (< ?birth "1900-01-01")

  (leftjoin

(bgp

  (triple ?person dbo:birthPlace dbr:Berlin)

  (triple ?person dbo:birthDate ?birth)

  (triple ?person foaf:name ?name)

)

(bgp (triple ?person dbo:deathDate ?death

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

(prefix ((dbo: http://dbpedia.org/ontology/)

 (dbr: http://dbpedia.org/resource/)

 (foaf: http://xmlns.com/foaf/0.1/))

  (slice _ 100

(distinct

  (project (?name ?birth ?death)

(conditional

  (sequence

(filter (< ?birth "1900-01-01")

  (quadpattern

(quad  ?person dbo:birthPlace 
dbr:Berlin)

(quad  ?person dbo:birthDate ?birth)

  ))

(quadpattern (quad  ?person foaf:name 
?name)))

  (quadpattern (quad  ?person dbo:deathDate 
?death)))



Hope this helps,



Rob

From: James Anderson 
Date: Wednesday, 20 September 2023 at 22:06
To: Hashim Khan 
Cc: users@jena.apache.org 
Subject: Re: Query features info
good evening;

if you want to reproduce those results, you will have to examine the parsed 
syntax tree.
that should comprise just two bgps, as that is the immediate syntax.
if, on the other hand, you examaine the results of a query planner, you are not 
looking at a systax tree, you are looking at the query processor's prospective 
exection plan.
execution model permits the transformations to which i alluded.
you will more likely get your desired representation by having jena emit an 
sse, rather than an execution plan.

best regards, from berlin,

> On 20. Sep 2023, at 17:48, Hashim Khan  wrote:
>
> Thanks for the quick reply.
>
> To be precise, I want to clarify the table on page 7 of the attached paper. 
> Here, the No. of BGPs is 2, and also some more values. I want to extract all 
> the info using Jena. But I could not till now. About the LSQ, I will check 
> it, but I am following this paper and want to reproduce the results.
>
> Best Regards,
> Hashim
>
> On Tue, Sep 19, 2023 at 4:18 PM James Anderson 
>  wrote:
> good afternoon;
>
> you have to consider that a query processor is free to consolidate statement 
> patterns in a nominal bgp - which itself implicitly joins them, or separate 
> them in order to either apply a different join strategy or - as in this case, 
> to interleave an operation under the suspicion that it will reduce solution 
> set cardinality.
>
> best regards, from berlin,
>
> > On 19. Sep 2023, at 13:20, Hashim Khan  wrote:
> >
> > Hi,
> >
> > Having a look on this SPARQL query:
> > ---
> > prefix dbo:
> > prefix dbr:
> > prefix foaf:
> >
> > SELECT DISTINCT ?name ?birth ?death
> > WHERE { ?person dbo:birthPlace  dbr:Berlin .
> >?person dbo:birthDate ?birth .
> >?person foaf:name ?name .
> > OPTIONAL { ?person dbo:deathDate ?death . }
> > FILTER (?birth < "1900-01-01") .
> > }
> > LIMIT 100
> > -
> > Using Apache Jena ARQ 

Re: Jena hangs on deleted files

2023-09-12 Thread Rob @ DNR
Well, yes, there shouldn’t be, but that wasn’t what Andy suggested/asked.

Have you verified that nothing else is holding references to those files in any 
way e.g.

lsof | grep /path/to/your/db

And checked that only a single Java process is listed in the output?

We don’t know your deployment environment, it could be some mundane background 
process (e.g. anti-virus, search indexer) running on your system, it could be a 
bug in the particular JVM you are using, or something else entirely but without 
any more details we can only guess at possibilities

Another long shot is that it could be a hardware issue, if you’re running the 
database on an SSD it could be driver optimisation to not actually delete files 
until the holding process exits to avoid unnecessary write operations and 
prolong the life of the drive

Rob

From: Mikael Pesonen 
Date: Monday, 11 September 2023 at 12:17
To: users@jena.apache.org 
Subject: Re: Jena hangs on deleted files
There should not be other processes accessing the files. When jena is
restarted, space from deleted files is released.

On 09/09/2023 18.56, Andy Seaborne wrote:
> This situation could be related to the other issues you're reported
> (corrupted node tables) if some other Linux process,not nece3ssarily
> java) is accessing the files.
>
> A process holding them open will stop them becoming recyclable by the OS.
>
> Andy
>
> On 08/09/2023 13:09, Mikael Pesonen wrote:
>> Just on a command line (dev system)
>>
>> /usr/bin/java -Xmx8G -jar fuseki-server.jar --update --port 3030
>> --config=../jena_config/fuseki_config.ttl
>>
>>
>> On 08/09/2023 11.47, Andy Seaborne wrote:
>>> In a container? As a VM?
>>>
>>> On 08/09/2023 07:36, Mikael Pesonen wrote:
 We are using Ubuntu.

 On Thu, 7 Sept 2023 at 16:33, Andy Seaborne  wrote:

> Are the database files on a MS Windows filesystem?
>
> There is a long-standing Java issue that memory mapped files on MS
> Windows do not get freed until the JVM exists.
>
> Various bugs in the OpenJDK bug database such as:
>
> https://bugs.openjdk.org/browse/JDK-4715154
>
>   Andy
>
> On 07/09/2023 13:06, Mikael Pesonen wrote:
>>
>> We used deleteOld param. The 50 gigs are ghost files that are
>> deleted
>> but not released, that's what I meant by hanging on deleted files.
>> Restarting jena releases them and now for example freed 50 gigs
>> of space.
>>
>> On 07/09/2023 15.02, Øyvind Gjesdal wrote:
>>> What does the content of the tdb2 folder look like?
>>>
>>> I think compact by default never deletes the old data, but you have
>>> parameters for making it delete the old content on completion.
>>>
>>> `--deleteOld` can be supplied to the tdb2.tdbcompact command
>>> line tool
>>> and
>>> `?deleteOld=true` can be supplied to the administration api when
>>> calling
>>> compact
>>>
> https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html#compact
>
>>>
>>> You can also delete  the Data- that isn't the latest one in the
>>> database folder.
>>>
>>> Best regards,
>>> Øyvind
>>>
>>> On Thu, Sep 7, 2023 at 1:33 PM Mikael Pesonen
>>> 
>>> wrote:
>>>
 After a while 25 gigs of files on data folder becomes 80 gigs
 of disk
 usage because Jena (4.6.1) doen't release files. Same with
 compact. Is
 this fixed in newer versions?

>>
>

>>


Re: Mystery memory leak in fuseki

2023-09-01 Thread Rob @ DNR
Yes and No

The embedded server mode of operation for Fuseki is dependent on Jetty.  But 
the real core of Fuseki is actually just plain Java Servlets and Filter’s and 
Fuseki’s own dynamic dispatch code.

FWIW I am also a big fan of JAX-RS but moving to JAX-RS would probably be a 
much more substantiative rewrite.  This would need to be done carefully to 
support Fuseki’s dynamic configuration model but I think it is possible, not 
sure it’s in-scope for Jena 5 timeframe though

Rob

From: Martynas Jusevičius 
Date: Thursday, 31 August 2023 at 19:35
To: users@jena.apache.org 
Subject: Re: Mystery memory leak in fuseki
Does Fuseki have direct code dependency on Jetty? Or would it be possible
to try switching to a different servlet container such as Tomcat?

JAX-RS, which I’ve advocated here multiple times, provides such a
higher-level abstraction above servlets that would enable easy switching.

On Fri, 25 Aug 2023 at 16.18, Dave Reynolds 
wrote:

> On 25/08/2023 11:44, Andy Seaborne wrote:
> >
> >
> > On 03/07/2023 14:20, Dave Reynolds wrote:
> >> We have a very strange problem with recent fuseki versions when
> >> running (in docker containers) on small machines. Suspect a jetty
> >> issue but it's not clear.
> >
> >  From the threads here, it does seem to be Jetty related.
>
> Yes.
>
> We've followed up on Rob's suggestions for tuning the jetty settings so
> we can use a stock fuseki. On 4.9.0 if we switch off direct buffer using
> in jetty altogether the problem does seem to go away. The performance
> hit we see is small and barely above noise.
>
> We currently have a soak test of leaving direct buffers on but limiting
> max and retained levels, that looks promising but too early to be sure.
>
> > I haven't managed to reproduce the situation on my machine in any sort
> > of predictable way where I can look at what's going on.
>
> Understood. While we can reproduce some effects in desktop test set ups
> the only real test has been to leave configurations running for days at
> a time in the real dev setting with all it's monitoring and
> instrumentation. Which makes testing any changes very painful, let alone
> deeper investigations.
>
> > For Jena5, there will be a switch to a Jetty to use uses jakarta.*
> > packages. That's no more than a rename of imports. The migration
> > EE8->EE9 is only repackaging.  That's Jetty10->Jetty11.
> >
> > There is now Jetty12. It is a major re-architecture of Jetty including
> > it's network handling for better HTTP/2 and HTTP/3.
> >
> > If there has been some behaviour of Jetty involved in the memory growth,
> > it is quite unlikely to carried over to Jetty12.
> >
> > Jetty12 is not a simple switch of artifacts for Fuseki. APIs have
> > changed but it's a step that going to be needed sometime.
> >
> > If it does not turn out that Fuseki needs a major re-architecture, I
> > think that Jena5 should be based on Jetty12. So far, it looks doable.
>
> Sound promising. Agreed that jetty12 is enough of a new build it's
> unlikely to have the same behaviour.
>
> We've being testing some of our troublesome queries on 4.9.0 on java 11
> vs java 17 and see a 10-15% performance hit on java 17 (even after we
> take control of the GC by forcing both to use the old parallel GC
> instead of G1). No idea why, seems wrong! Makes us inclined to stick
> with java 11 and thus jena 4.x series as long as we can.
>
> Dave
>
>


Re: riot cmd convert RDF to JSON-LD framing

2023-08-14 Thread Rob @ DNR
Riot, and more generally Jena’s, configuration symbols are actually URIs 
internally, so the --set option needs to receive the full URI for the symbol, 
which I think should start with http://jena.apache.org/riot/jsonld#, not just 
the Java constant names as they appear in the examples/API.

Also, I don’t believe that any of these context options expect to receive a 
file, rather they expect to contain a chunk of JSON itself so from the command 
line you’d probably need something like the following:

$ export FRAME=$(cat frame.json)
$ riot --out JSONLD_FRAME_PRETTY --set 
“http://jena.apache.org/riot/jsonld#JSONLD_FRAME=$FRAME” input.ttl

NB – Completely untested, I don’t use JSON-LD myself at all so no guarantees 
any of this will work, but hopefully this at least points you in the right 
direction to make progress

Rob

From: Martin 
Date: Monday, 14 August 2023 at 12:45
To: users@jena.apache.org 
Subject: riot cmd convert RDF to JSON-LD framing
Hi,

I would like to convert RDF (on Turtle format) to JSON-LD and apply a
JSON-LD framing specification to it (*) -- and I would prefer to do
this with the command line tooling that ships with Jena.

I can transform my RDF to JSON-LD with the command

  $ riot --out=jsonld [file]

but I have not found a way to pass my context json file to the command.
Attempts like this fails or does not pick up the context file:

 $ riot --out=JSONLD_FRAME_PRETTY --set JSONLD_CONTEXT=[file] [file]

These attempts are motivated by
https://jena.apache.org/documentation/io/rdf-output.html#json-ld


Is there a way to pass a context file to riot, or otherwise achieve
what I want using Jena's command line tools? If not, what is my best
other option?

Thanks!

Martin

(*) Apologies if I am not using correct terminology for this.


Re: Mystery memory leak in fuseki

2023-07-11 Thread Rob @ DNR
Dave

Thanks for the further information.

Have you experimented with using Jetty 10 but providing more detailed 
configuration?  Fuseki supports providing detailed Jetty configuration if 
needed via the --jetty-config option

The following section look relevant:

https://eclipse.dev/jetty/documentation/jetty-10/operations-guide/index.html#og-module-bytebufferpool

It looks like the default is that Jetty uses a heuristic to determine these 
values, sadly the heuristic in question is not detailed in that documentation.

Best guess from digging through their code is that the “heuristic” is this:

https://github.com/eclipse/jetty.project/blob/jetty-10.0.x/jetty-io/src/main/java/org/eclipse/jetty/io/AbstractByteBufferPool.java#L78-L84

i.e., ¼ of the configured max heap size.  This doesn’t necessarily align with 
the exact sizes of process growth you see but I note the documentation does 
explicitly say that buffers used can go beyond these limits but that those will 
just be GC’d rather than pooled for reuse.

Example byte buffer configuration at 
https://github.com/eclipse/jetty.project/blob/9a05c75ad28ebad4abbe624fa432664c59763747/jetty-server/src/main/config/etc/jetty-bytebufferpool.xml#L4

Any chance you could try customising this for your needs with stock Fuseki and 
see if this allows you to make the process size smaller and sufficiently 
predictable for your use case?

Rob

From: Dave Reynolds 
Date: Tuesday, 11 July 2023 at 08:58
To: users@jena.apache.org 
Subject: Re: Mystery memory leak in fuseki
For interest[*] ...

This is what the core JVM metrics look like when transitioning from a
Jetty10 to a Jetty9.4 instance. You can see the direct buffer cycling up
to 500MB (which happens to be the max heap setting) on Jetty 10, nothing
on Jetty 9. The drop in Mapped buffers is just because TDB hadn't been
asked any queries yet.

https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m=0

Here' the same metrics around the time of triggering a TDB backup. Shows
the mapped buffer use for TDB but no significant impact on heap etc.

https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna=0

These are all on the same instance as the RES memory trace:

https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8=0

Dave

[*] I've been staring and metric graphs for so many days I may have a
distorted notion of what's interesting :)

On 11/07/2023 08:39, Dave Reynolds wrote:
> After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in the
> production, containerized, environment then it is indeed very stable.
>
> Running at less that 6% of memory on 4GB machine compared to peaks of
> ~50% for versions with Jetty 10. RES shows as 240K with 35K shared
> (presume mostly libraries).
>
> Copy of trace is:
> https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8=0
>
> The high spikes on left of image are the prior run on with out of the
> box 4.7.0 on same JVM.
>
> The small spike at 06:00 is a dump so TDB was able to touch and scan all
> the (modest) data with very minor blip in resident size (as you'd hope).
> JVM stats show the mapped buffers for TDB jumping up but confirm heap is
> stable at < 60M, non-heap 60M.
>
> Dave
>
> On 10/07/2023 20:52, Dave Reynolds wrote:
>> Since this thread has got complex, I'm posting this update here at the
>> top level.
>>
>> Thanks to folks, especially Andy and Rob for suggestions and for
>> investigating.
>>
>> After a lot more testing at our end I believe we now have some
>> workarounds.
>>
>> First, at least on java 17, the process growth does seem to level out.
>> Despite what I just said to Rob, having just checked our soak tests, a
>> jena 4.7.0/java 17 test with 500MB max heap has lasted for 7 days.
>> Process size oscillates between 1.5GB and 2GB but hasn't gone above
>> that in a week. The oscillation is almost entirely the cycling of the
>> direct memory buffers used by Jetty. Empirically those cycle up to
>> something comparable to the set max heap size, at least for us.
>>
>> While this week long test was 4.7.0, based on earlier tests I suspect
>> 4.8.0 (and now 4.9.0) would also level out at least on a timescale of
>> days.
>>
>> The key has been setting the max heap low. At 2GB and even 1GB (the
>> default on a 4GB machine) we see higher peak levels of direct buffers
>> and overall process size grew to around 3GB at which point the
>> container is killed on the small machines. Though java 17 does seem to
>> be better behaved that java 11, so switching to that probably also
>> helped.
>>
>> Given the actual heap is low (50MB heap, 60MB non-heap) then needing
>> 2GB to run in feels high but is workable. So my previously suggested
>> rule of thumb that, in this low memory regime, allow 4x the max heap
>> size seems to work.
>>
>> Second, we're now 

Re: Mystery memory leak in fuseki

2023-07-10 Thread Rob @ DNR
Dave

Poked around a bit today but not sure I’ve reproduced anything as such or found 
any smoking guns

I ran a Fuseki instance with the same watch command you showed in your last 
message.  JVM Heap stays essentially static even after hours, there’s some 
minor fluctuation up and down in used heap space but the heap itself doesn’t 
grow at all.  Did this with a couple of different versions of 4.x to see if 
there’s any discernible difference but nothing meaningful showed up.  I also 
used 3.17.0 but again couldn’t reproduce the behaviour you are describing.

For reference I’m on OS X 13.4.1 using OpenJDK 17

The process peak memory (for all versions I tested) seems to peak at about 1.5G 
as reported by the vmmap tool.  Ongoing monitoring, i.e., OS X Activity Monitor 
shows the memory usage of the process fluctuating over time, but I don’t ever 
see the unlimited growth that your original report suggested.  Also, I didn’t 
set heap explicitly at all so I’m getting the default max heap of 4GB, and my 
actual heap usage was around 100 MB.

I see from vmmap that most of the memory appears to be virtual memory related 
to the many shared native libraries that the JVM links against which on a real 
OS is often swapped out as it’s not under active usage.

In a container, where swap is likely disabled, that’s obviously more 
problematic as everything occupies memory even if much of it might be for 
native libraries that are never needed by anything Fuseki does.  Again, I don’t 
see how that would lead to the apparently unbounded memory usage you’re 
describing.

You could try using jlink to build a minimal image where you only have the 
parts of the JDK that you need in the image.  I found the following old Jena 
thread - https://lists.apache.org/thread/dmmkndmy2ds8pf95zvqbcxpv84bj7cz6 - 
which actually describes an apparently similar memory issue but also has an 
example of a Dockerfile linked at the start of the thread that builds just such 
a minimal JRE for Fuseki.

Note that I also ran the leaks tool against the long running Fuseki processes 
and that didn’t find anything of note, 5.19KB of memory leaks over a 3.5 hr run 
so no smoking gun there.

Regards,

Rob

From: Dave Reynolds 
Date: Friday, 7 July 2023 at 11:11
To: users@jena.apache.org 
Subject: Re: Mystery memory leak in fuseki
Hi Andy,

Thanks for looking.

Good thought on some issue with stacked requests causing thread leak but
don't think that matches our data.

 From the metrics the number of threads and total thread memory used is
not that great and is stable long term while the process size grows, at
least in our situation.

This is based on both the JVM metrics from the prometheus scrape and by
switching on native memory checking and using jcmd to do various low
level dumps.

In a test set up we can replicate the long term (~3 hours) process
growth (while the heap, non-heap and threads stay stable) by just doing
something like:

watch -n 1 'curl -s http://localhost:3030/$/metrics'

With no other requests at all. So I think that makes it less likely the
root cause is triggered by stacked concurrent requests. Certainly the
curl process has exited completely each time. Though I guess there could
some connection cleanup going on in the linux kernel still.

 > Is the OOM kill the container runtime or Java exception?

We're not limiting the container memory but the OOM error is from docker
runtime itself:
 fatal error: out of memory allocating heap arena map

We have replicated the memory growth outside a container but not left
that to soak on a small machine to provoke an OOM, so not sure if the
OOM killer would hit first or get a java OOM exception first.

One curiosity we've found on the recent tests is that, when the process
has grown to dangerous level for the server, we do randomly sometimes
see the JVM (Temurin 17.0.7) spit out a thread dump and heap summary as
if there were a low level exception. However, there's no exception
message at all - just a timestamp the thread dump and nothing else. The
JVM seems to just carry on and the process doesn't exit. We're not
setting any debug flags and not requesting any thread dump, and there's
no obvious triggering event. This is before the server gets completely
out of the memory causing the docker runtime to barf.

Dave


On 07/07/2023 09:56, Andy Seaborne wrote:
> I tried running without any datasets. I get the same heap effect of
> growing slowly then a dropping back.
>
> Fuseki Main (fuseki-server did the same but the figures are from main -
> there is less going on)
> Version 4.8.0
>
> fuseki -v --ping --empty# No datasets
>
> 4G heap.
> 71M allocated
> 4 threads (+ Daemon system threads)
> 2 are not parked (i.e. they are blocked)
> The heap grows slowly to 48M then a GC runs then drops to 27M
> This repeats.
>
> Run one ping.
> Heap now 142M, 94M/21M GC cycle
> and 2 more threads at least for a while. They seem to go away after time.
> 2 are not parked.
>
> Now pause process the JVM, 

Re: OOM Killed

2023-07-10 Thread Rob @ DNR
While we appreciate that this unresolved memory issue is painful for users, I 
would strongly emphasise that the project DOES NOT recommend people use 
outdated versions of Jena.

3.17.0 released in November 2020 which makes it 2.5 years old at this point. 
There have been lots of security, performance and correctness fixes that have 
happened in that time.

Rob

From: Dave Reynolds 
Date: Monday, 10 July 2023 at 08:36
To: users@jena.apache.org 
Subject: Re: OOM Killed
There is an issue with memory growth in fuseki, though it's growth
outside of normal java heap and non-heap space.

See https://www.mail-archive.com/users@jena.apache.org/msg20362.html

For that scale of data and scale of machine suggest setting the heap
smaller -Xmx1G or -Xmx500M. Empirically the process growth seems to
largely levels off at around 4x the given heap size (though this very
much depends on the usage model and have no clear explanation for this).

You might also try -XX:MaxDirectMemorySize=1G or less though exactly
what size to set will depend on how much data is involved in your
queries. If the process dies with an exception about unable to allocate
new direct memory then increase it.

If this is not a public service liable to security issues and you are
able to use a 3.17.0 (or earlier) version of fuseki then those are not
subject to this growth issue. Or at least not to the version of the
issue that we are seeing in our own usage.

Dave

On 09/07/2023 20:33, Laura Morales wrote:
> I'm running a job that is submitting a lot of queries to a Fuseki server, in 
> parallel. My problem is that Fuseki is OOM-killed and I don't know how to fix 
> this. Some details:
>
> - Fuseki is queried as fast as possible. Queries take around 50-100ms to 
> complete so I think it's serving 10s of queries each second
> - Fuseki 4.8. OS is Debian 12 (minimal installation with only OS, Fuseki, no 
> desktop environments, uses only ~100MB of RAM)
> - all the queries are read queries. No updates, inserts, or other write 
> queries
> - all the queries are over HTTP to the Fuseki endpoint
> - database is TDB2 (created with tdb2.tdbloader)
> - database contains around 2.5M triples
> - the machine has 8GB RAM. I've tried on another PC with 16GB and it 
> completes the job. On 8GB though, it won't
> - with -Xmx6G it's killed earlier. With -Xmx2G it's killed later. Either way 
> it's always killed.
>
> Is there anything that I can tweak to avoid Fuseki getting killed? Something 
> that isn't "just buy more RAM".
> Thank you


Re: Mystery memory leak in fuseki

2023-07-04 Thread Rob @ DNR
Does this only happen in a container?  Or can you reproduce it running locally 
as well?

If you can reproduce it locally then attaching a profiler like VisualVM so you 
can take a heap snapshot and see where the memory is going that would be useful

Rob

From: Dave Reynolds 
Date: Tuesday, 4 July 2023 at 09:31
To: users@jena.apache.org 
Subject: Re: Mystery memory leak in fuseki
Tried 4.7.0 under most up to date java 17 and it acts like 4.8.0. After
16hours it gets to about 1.6GB and by eye has nearly flatted off
somewhat but not completely.

For interest here's a MEM% curve on a 4GB box (hope the link works).

https://www.dropbox.com/s/xjmluk4o3wlwo0y/fuseki-mem-percent.png?dl=0

The flattish curve from 12:00 to 17:20 is a run using 3.16.0 for
comparison. The curve from then onwards is 4.7.0.

The spikes on the 4.7.0 match the allocation and recovery of the direct
memory buffers. The JVM metrics show those cycling around every 10mins
and being reclaimed each time with no leaking visible at that level.
Heap, non-heap and mapped buffers are all basically unchanging which is
to be expected since it's doing nothing apart from reporting metrics.

Whereas this curve (again from 17:20 onwards) shows basically the same
4.7.0 set up on a separate host, showing that despite flattening out
somewhat usage continues to grow - a least on a 16 hour timescale.

https://www.dropbox.com/s/k0v54yq4kexklk0/fuseki-mem-percent-2.png?dl=0


Both of those runs were using Eclipse Temurin on a base Ubuntu jammy
container. Pervious runs used AWS Corretto on an AL2 base container.
Behaviour basically unchanged so eliminates this being some
Corretto-specific issue or a weird base container OS issue.

Dave

On 03/07/2023 14:54, Andy Seaborne wrote:
> Hi Dave,
>
> Could you try 4.7.0?
>
> 4.6.0 was 2022-08-20
> 4.7.0 was 2022-12-27
> 4.8.0 was 2023-04-20
>
> This is an in-memory database?
>
> Micrometer/Prometheus has had several upgrades but if it is not heap and
> not direct memory (I though that was a hard bound set at start up), I
> don't see how it can be involved.
>
>  Andy
>
> On 03/07/2023 14:20, Dave Reynolds wrote:
>> We have a very strange problem with recent fuseki versions when
>> running (in docker containers) on small machines. Suspect a jetty
>> issue but it's not clear.
>>
>> Wondering if anyone has seen anything like this.
>>
>> This is a production service but with tiny data (~250k triples, ~60MB
>> as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].
>>
>> We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
>> support) with no problems.
>>
>> Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of
>> a day or so to reach ~3GB of memory at which point the 4GB machine
>> becomes unviable and things get OOM killed.
>>
>> The strange thing is that this growth happens when the system is
>> answering no Sparql queries at all, just regular health ping checks
>> and (prometheus) metrics scrapes from the monitoring systems.
>>
>> Furthermore the space being consumed is not visible to any of the JVM
>> metrics:
>> - Heap and and non-heap are stable at around 100MB total (mostly
>> non-heap metaspace).
>> - Mapped buffers stay at 50MB and remain long term stable.
>> - Direct memory buffers being allocated up to around 500MB then being
>> reclaimed. Since there are no sparql queries at all we assume this is
>> jetty NIO buffers being churned as a result of the metric scrapes.
>> However, this direct buffer behaviour seems stable, it cycles between
>> 0 and 500MB on approx a 10min cycle but is stable over a period of
>> days and shows no leaks.
>>
>> Yet the java process grows from an initial 100MB to at least 3GB. This
>> can occur in the space of a couple of hours or can take up to a day or
>> two with no predictability in how fast.
>>
>> Presumably there is some low level JNI space allocated by Jetty (?)
>> which is invisible to all the JVM metrics and is not being reliably
>> reclaimed.
>>
>> Trying 4.6.0, which we've had less problems with elsewhere, that seems
>> to grow to around 1GB (plus up to 0.5GB for the cycling direct memory
>> buffers) and then stays stable (at least on a three day soak test).
>> We could live with allocating 1.5GB to a system that should only need
>> a few 100MB but concerned that it may not be stable in the really long
>> term and, in any case, would rather be able to update to more recent
>> fuseki versions.
>>
>> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then
>> keeps ticking up slowly at random intervals. We project that it would
>> take a few weeks to grow the scale it did under java 11 but it will
>> still eventually kill the machine.
>>
>> Anyone seem anything remotely like this?
>>
>> Dave
>>
>> [1]  500M heap may be overkill but there can be some complex queries
>> and that should still leave plenty of space for OS buffers etc in the
>> remaining memory on a 4GB machine.
>>
>>
>>


Re: Need suggestions on handling latency on asyncparser

2023-06-27 Thread Rob @ DNR
Can you characterise what you mean by latency here?  i.e. are you talking about 
a measurable delay or something else?

Your code sample looks incomplete because you get the iterator BUT you never 
consume the iterator.  If you don’t consume the iterator then you aren’t ever 
going to get any data from the parser.

Also, in at least one case you set the chunk size to 10 which is very small and 
means the parser can only read a tiny amount ahead on its background thread.  
You are better to have a larger chunk size so that the parser and the consuming 
thread can overlap computation better and cache more data in memory.

Rob

From: Abika Chitra 
Date: Monday, 26 June 2023 at 19:27
To: users@jena.apache.org 
Subject: Need suggestions on handling latency on asyncparser
Hi there,

This is Abika from Marklogic. I work on a project called MLCP (Marklogic 
contentpump) which used to bulk load and transfer data to and from Marklogic 
server.

Our MLCP project uses Jena framework to process RDFs. We have built this 
project with Jena 2.13 previously and now we are transitioning to Jena 4.8.0 
(latest available). Given the time frame, there are many changes in parsing 
with Jena. Right now we are using asyncparser (following a suggestion from jena 
javadocs). I notice there is some lag when using asyncparser and its also 
mentioned in the 
Javadoc.

In our codebase, To parse a bunch of files in the zip/archive, we create a 
runnable parser for each file and submit them to executor service. In this 
implementation, I see some of the files being skipped due to the latency in 
result deliverance from asyncparser API calls. For now we are considering to 
either implement a wait on the results from asyncparser (OR) stop creating 
parallel threads to process many files in a zip/archive.

I would like to get your suggestions on what are some robust ways to handle 
this latency?

Heres the code of the run() in runnable parser class. I see the latency in 
debugging when the API calls to riotparser and asyncparser doesn’t return data 
right after the call.


public void run() {

ErrorHandler handler = new ParserErrorHandler(fsname);
ParserProfile prof = RiotLib.profile(lang, fsname, handler);
try {
if (lang == Lang.TRIG) {
rdfInputStream = AsyncParser.of(in, lang, origFn).streamQuads();
rdfIter = rdfInputStream.iterator();
} else if (lang == Lang.NTRIPLES) {
rdfIter = RiotParsers.createIteratorNTriples(in,prof);
System.out.println("2else ntriples async run ");
} else if (lang == Lang.NQUADS) {
rdfIter = RiotParsers.createIteratorNQuads(in,prof);
}else {
rdfInputStream = AsyncParser.of(in, lang, 
fsname).setChunkSize(10).streamTriples();
rdfIter = rdfInputStream.iterator();

}
} catch (Exception ex) {
failed = true;
LOG.error("Parse error in RDF document(please check intactness and 
encoding); processing partial document:" + origFn + " " + ex.getMessage());
ex.printStackTrace();
}
}



pool = Executors.newFixedThreadPool(1);

RunnableParser jenaStreamingParser = new RunnableParser(origFn, fsname, in, 
lang);

pool.submit(jenaStreamingParser);



Regards,
Abika

This message and any attached documents contain information of MarkLogic and/or 
its customers that may be confidential and/or privileged. If you are not the 
intended recipient, you may not read, copy, distribute, or use this 
information. If you have received this transmission in error, please notify the 
sender immediately by reply e-mail and then delete this message. This email may 
contain pricing or other suggested contract terms related to MarkLogic software 
or services. Any such terms are not binding on MarkLogic unless and until they 
are included in a definitive agreement executed by MarkLogic.


Re: Confirming on usage of andrewoma dexx collection from Jena-base

2023-06-27 Thread Rob @ DNR
Are you concerned by the com.github part of the coordinates?

This is standard practise for open source projects that are hosted on GitHub, 
and is documented in the official Maven Central Guidelines on Coordinates at 
https://central.sonatype.org/publish/requirements/coordinates/#introduction

Rob

From: Lorenz Buehmann 
Date: Tuesday, 27 June 2023 at 06:32
To: users@jena.apache.org 
Subject: Re: Confirming on usage of andrewoma dexx collection from Jena-base
Hi,

are you talking about your own fork of Jena in your company? And you're
asking if there is anything preventing you from modifying the POM file
in jena-base module? Is that something the Apache 2 License would care
about? Isn't it more about your Marklogic product in the end?

Or do you want to redistribute that adopted Jena version at any place
and you're not sure?


Cheers,

Lorenz

On 26.06.23 20:26, Abika Chitra wrote:
> Hi There,
>
> I would like to confirm on the usage of the 
> com.github.andrewoma.dexx:collection:jar in the jena-base jar. Our 
> application needs this to be mentioned in our pom for runtime execution and 
> we also have to package it with our product for commandline dependency usage. 
> Since the andrewoma repository looked a little different than an official 
> library. We would like to check with the team if this is okay to be packaged 
> as a third party library addition.
>
> Regards,
> Abika
>
>
> This message and any attached documents contain information of MarkLogic 
> and/or its customers that may be confidential and/or privileged. If you are 
> not the intended recipient, you may not read, copy, distribute, or use this 
> information. If you have received this transmission in error, please notify 
> the sender immediately by reply e-mail and then delete this message. This 
> email may contain pricing or other suggested contract terms related to 
> MarkLogic software or services. Any such terms are not binding on MarkLogic 
> unless and until they are included in a definitive agreement executed by 
> MarkLogic.
>
--
Lorenz Bühmann
Research Associate/Scientific Developer

Email buehm...@infai.org

Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109 
Leipzig | Germany


Re: CVE-2023-22665 Risk using Fuseki Pre 4.8.0

2023-06-01 Thread Rob @ DNR
Yes, prior to 4.8.0 users can craft a query that calls arbitrary JavaScript 
functions even if you have not explicitly configured custom scripts.

As discussed on our Security Advisories page [1] the projects advice is always 
to use the latest version available.

Or as already noted in this thread run using Java 17 as that does not have a 
script engine embedded by default.  Java code is generally forward compatible 
safe so even though the project releases builds made to target Java 11 it’s 
fine to run that on a newer JVM.

Is there any particular reason you haven’t yet upgraded to 4.8.0?

Rob

[1]: 
https://jena.apache.org/about_jena/security-advisories.html#standard-mitigation-advice

From: Brandon Sara 
Date: Thursday, 1 June 2023 at 02:05
To: users@jena.apache.org 
Subject: Re: CVE-2023-22665 Risk using Fuseki Pre 4.8.0
I’m running with a version built and run with java 11. Given this, is there 
still a risk/concern if I don’t have custom scripts configured at all on the 
Fuseki server?

On May 31, 2023, at 12:06 PM, Andy Seaborne  wrote:

"EXTERNAL EMAIL" – Always use caution when reviewing mail from outside of the 
organization.



On 31/05/2023 17:17, Brandon Sara wrote:
>
> With CVE-2023-22665, what is the risk of using Fuseki pre-4.8.0 that does not 
> have custom scripts configured in any configurations? Is there only a risk if 
> custom scripts are set up to be used by Fuseki or is there a risk regardless 
> of configuration?
>
> Thanks.

Java17 does not have javascript engine, unless the deployment adds one.

So running on a Java17 means that scripts can't execute.

The issue is Java11, where there is a script engine in the JVM runtime.

Andy

https://openjdk.org/jeps/372
Nashorn removed at Java15.


No PHI in Email: PointClickCare and Collective Medical, A PointClickCare 
Company, policies prohibit sending protected health information (PHI) by email, 
which may violate regulatory requirements. If sending PHI is necessary, please 
contact the sender for secure delivery instructions.

Confidentiality Notice: This email message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.


Re: Binary literals

2023-05-04 Thread Rob @ DNR
My 2 cents:  Base 64 might be preferable to Hex encoding since it is inherently 
more compact

Rob

From: Nicholas Car 
Date: Thursday, 4 May 2023 at 10:58
To: users@jena.apache.org 
Subject: Re: Binary literals
Hi Rob,

Thanks for this: it is pretty much as I thought!

I think we will be able to cater for WKB then in GeoSPARQL 1.3 with just hex 
encoding of the value and ^^geo:wkbLiteral and then, as you say, implementers, 
like Jena-geosparql, can just read the hex into their spatial indexes one-time.

I see little value in this other than meeting an allowed data type in the 
Simple Features standard, then again, I see little value in KML and other 
existing, allowed, formats too!

Cheers, Nick




--- Original Message ---
On Thursday, May 4th, 2023 at 18:30, Rob @ DNR  wrote:


> Well, the RDF specifications fundamentally define RDF literals to be the 
> following:
>
> * a lexical form, being a Unicode 
> [UNICODEhttps://www.w3.org/TR/rdf11-concepts/#bib-UNICODE] string, which 
> should be in Normal Form C [NFChttps://www.w3.org/TR/rdf11-concepts/#bib-NFC],
>
> https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
>
> So, you are effectively forced to use some sort of string-based encoding of 
> the binary data to represent any literal, whether that underlying datatype is 
> truly binary data.
>
> Now in principle you could define a custom implementation of the LiteralLabel 
> interface that stores the value as true binary, i.e. byte[], and only 
> materialises it into a string encoding when absolutely necessary. This could 
> then be used to create instances via NodeFactory.create(LiteralLabel).
>
> However, data into and out of the system is generally going to be via a RDF 
> serialisation, which again will require string encoding or decoding as 
> appropriate. And the parsers don’t really care about datatypes so your custom 
> implementation wouldn’t get used. Thus, whether a custom LiteralLabel would 
> actually gain you anything would depend on how the data is coming into the 
> system and how you consume it. If the data is coming in via some programmatic 
> means that isn’t parsing serialised RDF then maybe but I don’t think it would 
> gain you much.
>
> For spatial indexing generally the approach of a GeoSPARQL implementation is 
> to build the spatial index up-front so you’d only pay the cost of the string 
> to binary decoding once when the index was first built from the RDF data. The 
> spatial index is going to convert the incoming geo-data into its own internal 
> index structures that will be very efficient to access, at which point 
> whether the binary data was originally string encoded is irrelevant.
>
> Regards,
>
> Rob Vesse
>
> From: Nicholas Car n...@kurrawong.net
>
> Date: Wednesday, 3 May 2023 at 23:22
> To: users@jena.apache.org users@jena.apache.org
>
> Subject: Re: Binary literals
> I see Base64 is an XSD option too, but I’m most interested in “true” binary, 
> as opposed to binary-as-text options, and whether any exist!
>
> Nick
>
> On Thu, May 4, 2023 at 8:13 am, Nicholas Car <[n...@kurrawong.net](mailto:On 
> Thu, May 4, 2023 at 8:13 am, Nicholas Car < wrote:
>
> > Dear Jena users,
> >
> > How can I store binary literals in RDF and in Jena/Fuseki?
> >
> > There is xsd:hexBinary for arbitrary binary data but is there a better/more 
> > efficient/another way to store binary literals in Jena?
> >
> > The reason I ask is that a future version of GeoSPARQL might want to 
> > include WKB - Well-Known Binary - as a geometry format option. We would 
> > hope this can be efficiently accessed by a spatial index so we want to know 
> > how to handle perhaps a custom data type, perhaps geo:wkbLiteral, and how 
> > best to store this in Jena, perhaps not as hex text.
> >
> > Thanks, Nick


Re: Binary literals

2023-05-04 Thread Rob @ DNR
Well, the RDF specifications fundamentally define RDF literals to be the 
following:

  *   a lexical form, being a Unicode 
[UNICODE] string, which 
should be in Normal Form C [NFC],
https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal

So, you are effectively forced to use some sort of string-based encoding of the 
binary data to represent any literal, whether that underlying datatype is truly 
binary data.

Now in principle you could define a custom implementation of the LiteralLabel 
interface that stores the value as true binary, i.e. byte[], and only 
materialises it into a string encoding when absolutely necessary.  This could 
then be used to create instances via NodeFactory.create(LiteralLabel).

However, data into and out of the system is generally going to be via a RDF 
serialisation, which again will require string encoding or decoding as 
appropriate.  And the parsers don’t really care about datatypes so your custom 
implementation wouldn’t get used.  Thus, whether a custom LiteralLabel would 
actually gain you anything would depend on how the data is coming into the 
system and how you consume it.  If the data is coming in via some programmatic 
means that isn’t parsing serialised RDF then maybe but I don’t think it would 
gain you much.

For spatial indexing generally the approach of a GeoSPARQL implementation is to 
build the spatial index up-front so you’d only pay the cost of the string to 
binary decoding once when the index was first built from the RDF data.  The 
spatial index is going to convert the incoming geo-data into its own internal 
index structures that will be very efficient to access, at which point whether 
the binary data was originally string encoded is irrelevant.

Regards,

Rob Vesse

From: Nicholas Car 
Date: Wednesday, 3 May 2023 at 23:22
To: users@jena.apache.org 
Subject: Re: Binary literals
I see Base64 is an XSD option too, but I’m most interested in “true” binary, as 
opposed to binary-as-text options, and whether any exist!

Nick

On Thu, May 4, 2023 at 8:13 am, Nicholas Car <[n...@kurrawong.net](mailto:On 
Thu, May 4, 2023 at 8:13 am, Nicholas Car < wrote:

> Dear Jena users,
>
> How can I store binary literals in RDF and in Jena/Fuseki?
>
> There is xsd:hexBinary for arbitrary binary data but is there a better/more 
> efficient/another way to store binary literals in Jena?
>
> The reason I ask is that a future version of GeoSPARQL might want to include 
> WKB - Well-Known Binary - as a geometry format option. We would hope this can 
> be efficiently accessed by a spatial index so we want to know how to handle 
> perhaps a custom data type, perhaps geo:wkbLiteral, and how best to store 
> this in Jena, perhaps not as hex text.
>
> Thanks, Nick


Re: Strategies to avoid log flooding

2023-03-29 Thread Rob @ DNR
jar:4.6.1]
Mar 27 13:13:33 insight-terms java[2512289]: #011at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:894)
~[fuseki-server.jar:4.6.1]
Mar 27 13:13:33 insight-terms java[2512289]: #011at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1038)
~[fuseki-server.jar:4.6.1]
Mar 27 13:13:33 insight-terms java[2512289]: #011at
java.lang.Thread.run(Thread.java:829) ~[?:?]
apps-fileview.texmex_20230316.01_p2


On 28/03/2023 16.04, Rob @ DNR wrote:
> A GitHub issue with a minimal example query that reproduces the issue would 
> be a good start so we can reproduce the issue and look into a fix
>
> In workaround terms end users control their logging configuration so you 
> could create a Log4j configuration that disables logging for the specific 
> offending logger (assuming that this is a sufficiently specific logger to not 
> suppress actually relevant logging)
>
> Rob
>
> From: Mikael Pesonen 
> Date: Tuesday, 28 March 2023 at 11:21
> To: users@jena.apache.org 
> Subject: Strategies to avoid log flooding
> Hi,
>
> there are some cases where Jena generates dozens of gigs, maybe even
> terabytes, of log in one query. If you add a bad REGEX, it generates a
> long warning level exception for every row in db, or atleast million of
> them (disk filled up so don't know). Is there another way to avoid this
> except disable warnings?
>

--
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi<http://www.lingsoft.fi>

Speech Applications - Language Management - Translation - Reader's and Writer's 
Tools - Text Tools - E-books and M-books

Mikael Pesonen
Semantic Technologies

e-mail: mikael.peso...@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND


Re: Strategies to avoid log flooding

2023-03-28 Thread Rob @ DNR
A GitHub issue with a minimal example query that reproduces the issue would be 
a good start so we can reproduce the issue and look into a fix

In workaround terms end users control their logging configuration so you could 
create a Log4j configuration that disables logging for the specific offending 
logger (assuming that this is a sufficiently specific logger to not suppress 
actually relevant logging)

Rob

From: Mikael Pesonen 
Date: Tuesday, 28 March 2023 at 11:21
To: users@jena.apache.org 
Subject: Strategies to avoid log flooding
Hi,

there are some cases where Jena generates dozens of gigs, maybe even
terabytes, of log in one query. If you add a bad REGEX, it generates a
long warning level exception for every row in db, or atleast million of
them (disk filled up so don't know). Is there another way to avoid this
except disable warnings?


Re: from a named graph in a federated query

2023-02-03 Thread Rob @ DNR
The SERVICE clause can include any clauses you can use elsewhere in the query.  
You can use a GRAPH clause inside your SERVICE clause i.e.

SERVICE 

{
GRAPH ?g { ?protein biolink:provided_by ?soft . }
}

If the results might be in the default graph or a named graph you can use UNION 
and GRAPH within your SERVICE clause i.e.

SERVICE 

{
{ GRAPH ?g { ?protein biolink:provided_by ?soft . } } UNION { ?protein 
biolink:provided_by ?soft }
}

Hope this helps,

Rob

From: Steven Blanchard 
Date: Friday, 3 February 2023 at 08:32
To: users 
Subject: from a named graph in a federated query
Dear Jena users,

I would like to do a federated query (with SERVICE) but on a named
graph and not on the default graph. How it is possible?

This query works if data are in the default graph :
query = """
PREFIX biolink: <>
PREFIX up: <>
PREFIX reaction: <>

SELECT DISTINCT
?reaction
(COUNT(DISTINCT ?soft) AS ?NSOFT)
FROM 
WHERE {
SERVICE
<>
{
?protein biolink:provided_by ?soft .
}
?protein biolink:category biolink:Protein .
?reaction biolink:has_catalyst ?protein .

}
GROUP BY ?reaction
"""

But i have no results if data are in a named graph:
query = """
PREFIX biolink: <>
PREFIX up: <>
PREFIX reaction: <>

SELECT DISTINCT
?reaction
(COUNT(DISTINCT ?soft) AS ?NSOFT)
FROM 
WHERE {
SERVICE
<>
{
?protein biolink:provided_by ?soft .
}
?protein biolink:category biolink:Protein .
?reaction biolink:has_catalyst ?protein .

}
GROUP BY ?reaction
"""

What is the synthax to query a named graph by SERVICE ?

Thanks in advance,

Steven


Re: Why does the OSPG.dat file grows so much more than all other files?

2023-02-01 Thread Rob @ DNR
Speculating heavily here:

Each index is sorted relative to its keys, which as already noted earlier in 
the thread are a sequence of 8-byte Node IDs.  If the update patterns for this 
dataset primarily involve changing the objects of the quads then that could 
lead to much more frequent rebalancing and rewriting of the O based index.  
This could be particularly true if the objects are of a type that TDB inlines, 
e.g. integers, since the Node ID encoding preserves ordering for some inlined 
types (this allows for range based scans to optimise some query filters).  So 
frequently updating an object that has an ordered inline-able value would cause 
the index entry for that quad to be shuffled elsewhere in the B+Tree frequently.

So hypothetically if you have triples of the form   
“1”^^xsd:integer where the object is some counter/metric you are frequently 
updating that’d cause lots of churn in the OSPG index.  The other indexes would 
be less affected because the other nodes in the quad are changing less 
frequently.

Rob

From: Andy Seaborne 
Date: Wednesday, 1 February 2023 at 10:11
To: users@jena.apache.org 
Subject: Re: Why does the OSPG.dat file grows so much more than all other files?


On 01/02/2023 07:20, Lorenz Buehmann wrote:
> Interesting insights from both of you, thanks.
>
> @Andy do you have rough idea why only the OSPG index was that large
> compared to the others? What kind of updates would lead to that result?

No. There's no reason from TDB code to treat one index differently.

I suspect that some thing the container host did something or a second
container with the same database file ran at some time.

The index is possibly corrupt - the compaction uses SPOG and does not
touch OSPG so the DB becomes valid.

We don't much about the usage - clearly there is a high update rate but
over what time period.

 Andy

>
>
> On 30.01.23 21:40, Andy Seaborne wrote:
>> Elton - thanks for the update.
>>
>> The index sizes look much more like what I was getting using 100e6
>> BSBM data as a test.
>>
>> Inline ...
>>
>> On 30/01/2023 01:49, Elton Soares wrote:
>>> Hi Lorenz and Andy,
>>>
>>> Thank you for your quick responses and suggestions.
>>>
>>> Q: "Do you have lots of may large literals in your data?"
>>> A: I cannot be sure yet, but as Andy mentioned, the documentation
>>> indicates that the indexes store 8 byte entries instead of the
>>> literals strings representations
>>> (https://jena.apache.org/documentation/tdb/architecture.html). Thus,
>>> although initially we though that the reason the OSPG.dat was so much
>>> larger could be the number of objects being a lot larger than the
>>> number of predicates, subjects and graphs, or the fact that the
>>> literals stored in those objects could be too large, after discussing
>>> internally what is expressed in the documentation we considered that
>>> was very unlikely that any of these hypotheses was true, although we
>>> could be easily convinced otherwise by someone who knows the source
>>> code better than us.
>>>
>>> Q: "Also, did you try a compaction on the database? If not, can you
>>> try it and post the new file sizes afterwards? Note, they will be
>>> located in a new ./Data- directory, e.g. before Data-0001 and
>>> afterwards Data-0002"
>>>
>>> After your suggestion, I've tried to run two compression strategies
>>> on this dataset to see which one would work best.
>>> The one I'm referring to as "official" is the one that uses the
>>> "/$/compact" endpoint and the one I'm referring to as "unofficial" is
>>> the one where I create an NQuads backup and upload it to a new
>>> dataset using the TDBLoader.
>>> The reason I attempted this second strategy is because a
>>> StackOverflow post suggested that it could be significantly more
>>> efficient than the "official" strategy
>>> (https://stackoverflow.com/questions/60501386/compacting-a-dataset-in-apache-jena-fuseki/60631699#60631699).
>>
>> Could be - it's offline to backup-restore so a bulk loader can be used
>> for the restore (and you get a backup file as a record).
>>
>>> We will consider upgrading our Jena Fuseki server to version 4.7.0,
>>> although it is not yet clear that the growth we saw in the OSPG.dat
>>> could be avoided by the changes made from 4.4.0 to 4.7.0. I'll try to
>>> take some time to look into the changelog more carefully to see if
>>> there is anything that seems to relate to that.
>>
>> From your original sizes, would I be right in guessing you had
>> compacted at all and also that you do a significant amount of updates?
>>
>> 4.7.0 wouldn't change the growth situation - it does make compaction
>> in a live server more reliable.
>>
>>> Here is a summary of the results I've obtained with both compression
>>> strategies (in markdown notation):
>>>
>>> ## Original Dataset
>>>
>>> RDF Stats:
>>>   - Triples: 65222513 (Approximately 65 million)
>>>   - Subjects: 20434264 (Aproximately 20 million)
>>>   - Objects: 8565221 (Aproximately 8 million)
>>>   -