Hi Vince

> JVM memory derived from container limits

What do you mean by this specifically?

As has been discussed and referenced previously on this list, and is noted in 
our TDB FAQs [1], much of the memory usage for TDB databases is in terms of 
off-heap usage via memory mapped files.

Therefore, setting the JVM heap too high can actually reduce performance as the 
JVM is then competing against the OS for memory and forcing the mapped files to 
be paged out reducing performance.

So firstly, I’d make sure you aren’t setting the JVM heap to use too much of 
your available memory.  Ensure you are leaving some headroom between JVM heap 
and container limits for OS usage for the memory mapped files.  Since you 
mention you have Grafana in place I’d also look at any metrics that might be 
available around memory mapped file usage/paging etc. to see if this might be 
your problem.

> The second query is memory-intensive

Yes, operators like DISTINCT that require the query engine to keep large chunks 
of the data in-memory are always going to be memory-intensive.  The Jena query 
engine is generally designed for lazy streaming and calculation of results as 
much as possible.  If you have control over queries being issued then I would 
look at whether you actually need to be using operators like DISTINCT in your 
queries.

> TDB optimizer, but that isn’t an option with our number of datasets
   and graphs, as far as we can tell

Don’t really follow this statement.  I assume you’re referring to the optional 
stats based optimiser?  Unless your datasets are being frequently updated, I 
don’t see why you wouldn’t gain some value from generating the stats for each 
dataset.  Remember that the TDB optimizer works on a per-dataset basis so you 
can generate stats files for each dataset, or some subset of your datasets, 
placing each stats file into the relevant database directory and they don’t 
interact with each other.

> In production, we have multiple datasets consisting of millions of triples,
and our end goal is to improve query times for our users.

Often the best way to improve query times for users is either to exert more 
control over the queries (if the queries aren’t end-user controlled) using 
tools like Jena’s qparse [2] to analyse your queries and experiment with 
modifications to them that might optimise better.  Or if you permit arbitrary 
queries to educate/train your users on best practises/how to write better 
queries/SPARQL optimisation etc.

Another thing to consider is that if you aren’t doing federated queries across 
your multiple datasets you might actually be better off having independent 
smaller instances of Fuseki running on smaller AWS nodes, each serving a 
separate dataset.  This would give you more flexibility to tune the resources, 
JVM heap etc. for each dataset depending on its needs.

Hope this helps,

Rob


[1] https://jena.apache.org/documentation/tdb/faqs.html#java-heap
[2] https://jena.apache.org/documentation/query/explain.html

From: Vince Wouters via users <[email protected]>
Date: Thursday, 15 January 2026 at 12:25
To: [email protected] <[email protected]>
Cc: Vince Wouters <[email protected]>
Subject: Increasing Apache Jena Performance

Hello Jena community,

We’re looking for guidance on what other avenues are worth exploring to
improve overall query performance on our Apache Jena Fuseki instance, which
consists of multiple datasets, each containing millions of triples.


*Setup*

   - Apache Jena Fuseki *5.5.0*
   - TDB2-backed datasets
   - Running on *AWS EKS (Kubernetes)*
   - Dataset size: ~15.6 million triples


*Infrastructure*

   - Instances tested:
      - *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory)
      - *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory)
   - JVM memory derived from container limits
   - Grafana metrics show no storage bottleneck (IOPS and throughput remain
   well within limits)

*Test Queries*
SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } }

Takes around 80 seconds for our dataset.

SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g {
?s ?p ?o } }

Takes around 120 seconds for our dataset.

*What we’ve observed*

   - The first query is stable once a minimum heap is available.
   - The second query is memory-intensive:
      - On the smaller instance, it will time out once available heap drops
      below a certain threshold.
      - On the larger instance we see clear improvements, but not linear
      scaling.
   - Increasing heap helps to a point, but does not feel like the full
   solution.


*Other things we’ve tried*

   - TDB optimizer, but that isn’t an option with our number of datasets
   and graphs, as far as we can tell.

*Question*
Given this type of workload and dataset size, what other routes should we
consider to improve performance, beyond simply adjusting heap size?

In production, we have multiple datasets consisting of millions of triples,
and our end goal is to improve query times for our users.

Any guidance or pointers would be much appreciated.

Thanks in advance.

Reply via email to