Hi Vince
JVM memory derived from container limits
What do you mean by this specifically?
As has been discussed and referenced previously on this list, and is noted
in our TDB FAQs [1], much of the memory usage for TDB databases is in terms
of off-heap usage via memory mapped files.
Therefore, setting the JVM heap too high can actually reduce performance
as the JVM is then competing against the OS for memory and forcing the
mapped files to be paged out reducing performance.
So firstly, I’d make sure you aren’t setting the JVM heap to use too much
of your available memory. Ensure you are leaving some headroom between JVM
heap and container limits for OS usage for the memory mapped files. Since
you mention you have Grafana in place I’d also look at any metrics that
might be available around memory mapped file usage/paging etc. to see if
this might be your problem.
The second query is memory-intensive
Yes, operators like DISTINCT that require the query engine to keep large
chunks of the data in-memory are always going to be memory-intensive. The
Jena query engine is generally designed for lazy streaming and calculation
of results as much as possible. If you have control over queries being
issued then I would look at whether you actually need to be using operators
like DISTINCT in your queries.
TDB optimizer, but that isn’t an option with our number of datasets
and graphs, as far as we can tell
Don’t really follow this statement. I assume you’re referring to the
optional stats based optimiser? Unless your datasets are being frequently
updated, I don’t see why you wouldn’t gain some value from generating the
stats for each dataset. Remember that the TDB optimizer works on a
per-dataset basis so you can generate stats files for each dataset, or some
subset of your datasets, placing each stats file into the relevant database
directory and they don’t interact with each other.
In production, we have multiple datasets consisting of millions of
triples,
and our end goal is to improve query times for our users.
Often the best way to improve query times for users is either to exert
more control over the queries (if the queries aren’t end-user controlled)
using tools like Jena’s qparse [2] to analyse your queries and experiment
with modifications to them that might optimise better. Or if you permit
arbitrary queries to educate/train your users on best practises/how to
write better queries/SPARQL optimisation etc.
Another thing to consider is that if you aren’t doing federated queries
across your multiple datasets you might actually be better off having
independent smaller instances of Fuseki running on smaller AWS nodes, each
serving a separate dataset. This would give you more flexibility to tune
the resources, JVM heap etc. for each dataset depending on its needs.
Hope this helps,
Rob
[1] https://jena.apache.org/documentation/tdb/faqs.html#java-heap
[2] https://jena.apache.org/documentation/query/explain.html
From: Vince Wouters via users <[email protected]>
Date: Thursday, 15 January 2026 at 12:25
To: [email protected] <[email protected]>
Cc: Vince Wouters <[email protected]>
Subject: Increasing Apache Jena Performance
Hello Jena community,
We’re looking for guidance on what other avenues are worth exploring to
improve overall query performance on our Apache Jena Fuseki instance, which
consists of multiple datasets, each containing millions of triples.
*Setup*
- Apache Jena Fuseki *5.5.0*
- TDB2-backed datasets
- Running on *AWS EKS (Kubernetes)*
- Dataset size: ~15.6 million triples
*Infrastructure*
- Instances tested:
- *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory)
- *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory)
- JVM memory derived from container limits
- Grafana metrics show no storage bottleneck (IOPS and throughput remain
well within limits)
*Test Queries*
SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } }
Takes around 80 seconds for our dataset.
SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g {
?s ?p ?o } }
Takes around 120 seconds for our dataset.
*What we’ve observed*
- The first query is stable once a minimum heap is available.
- The second query is memory-intensive:
- On the smaller instance, it will time out once available heap drops
below a certain threshold.
- On the larger instance we see clear improvements, but not linear
scaling.
- Increasing heap helps to a point, but does not feel like the full
solution.
*Other things we’ve tried*
- TDB optimizer, but that isn’t an option with our number of datasets
and graphs, as far as we can tell.
*Question*
Given this type of workload and dataset size, what other routes should we
consider to improve performance, beyond simply adjusting heap size?
In production, we have multiple datasets consisting of millions of triples,
and our end goal is to improve query times for our users.
Any guidance or pointers would be much appreciated.
Thanks in advance.