Separate from the memory limits, which others have discussed, storage
performance makes a big difference.
We successfully run in AWS EKS Kubernetes with dataset sizes around 120
million triples (and larger datasets elsewhere) but to make that work
well we use NVMe ephemeral storage rather than EBS/EFS. We use instances
with large ephemeral like i4i.large for that. There are various ways to
use ephemeral from k8s but we found the simple brute force approach the
best - map ephemeral to the container storage area so emptyDir volumes
are on ephemeral, use those for the fuseki database area. Have good
monitoring and good backup and container init procedures.
Easy enough to test on a simple EC2 instance to see whether NVMe gives
you much performance benefit for your query patterns and then, only if
so, figure out how to you want to manage ephemeral in k8s.
Oh and on memory use, assuming you have a prometheus/grafana or similar
monitoring stack set up, then the JVM metrics are very handy guides. The
container WSS (working set size) metric equates to the k8s metric and
should be running rather higher than the JVM total. That difference is
largely the buffered pages Rob mentions. We'll typically expect that
(and thus the pod memory request) to be 2-3 times the committed heap.
We use TDB rather than TDB2 so our experience may not be fully
representative.
Dave
On 15/01/2026 12:24, Vince Wouters via users wrote:
Hello Jena community,
We’re looking for guidance on what other avenues are worth exploring to
improve overall query performance on our Apache Jena Fuseki instance, which
consists of multiple datasets, each containing millions of triples.
*Setup*
- Apache Jena Fuseki *5.5.0*
- TDB2-backed datasets
- Running on *AWS EKS (Kubernetes)*
- Dataset size: ~15.6 million triples
*Infrastructure*
- Instances tested:
- *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory)
- *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory)
- JVM memory derived from container limits
- Grafana metrics show no storage bottleneck (IOPS and throughput remain
well within limits)
*Test Queries*
SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } }
Takes around 80 seconds for our dataset.
SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g {
?s ?p ?o } }
Takes around 120 seconds for our dataset.
*What we’ve observed*
- The first query is stable once a minimum heap is available.
- The second query is memory-intensive:
- On the smaller instance, it will time out once available heap drops
below a certain threshold.
- On the larger instance we see clear improvements, but not linear
scaling.
- Increasing heap helps to a point, but does not feel like the full
solution.
*Other things we’ve tried*
- TDB optimizer, but that isn’t an option with our number of datasets
and graphs, as far as we can tell.
*Question*
Given this type of workload and dataset size, what other routes should we
consider to improve performance, beyond simply adjusting heap size?
In production, we have multiple datasets consisting of millions of triples,
and our end goal is to improve query times for our users.
Any guidance or pointers would be much appreciated.
Thanks in advance.