Re: Increasing Apache Jena Performance

Dave Reynolds Fri, 16 Jan 2026 02:38:30 -0800

Separate from the memory limits, which others have discussed, storageperformance makes a big difference.

We successfully run in AWS EKS Kubernetes with dataset sizes around 120million triples (and larger datasets elsewhere) but to make that workwell we use NVMe ephemeral storage rather than EBS/EFS. We use instanceswith large ephemeral like i4i.large for that. There are various ways touse ephemeral from k8s but we found the simple brute force approach thebest - map ephemeral to the container storage area so emptyDir volumesare on ephemeral, use those for the fuseki database area. Have goodmonitoring and good backup and container init procedures.

Easy enough to test on a simple EC2 instance to see whether NVMe givesyou much performance benefit for your query patterns and then, only ifso, figure out how to you want to manage ephemeral in k8s.

Oh and on memory use, assuming you have a prometheus/grafana or similarmonitoring stack set up, then the JVM metrics are very handy guides. Thecontainer WSS (working set size) metric equates to the k8s metric andshould be running rather higher than the JVM total. That difference islargely the buffered pages Rob mentions. We'll typically expect that(and thus the pod memory request) to be 2-3 times the committed heap.

We use TDB rather than TDB2 so our experience may not be fullyrepresentative.


Dave

On 15/01/2026 12:24, Vince Wouters via users wrote:

Hello Jena community,

We’re looking for guidance on what other avenues are worth exploring to
improve overall query performance on our Apache Jena Fuseki instance, which
consists of multiple datasets, each containing millions of triples.


*Setup*

    - Apache Jena Fuseki *5.5.0*
    - TDB2-backed datasets
    - Running on *AWS EKS (Kubernetes)*
    - Dataset size: ~15.6 million triples


*Infrastructure*

    - Instances tested:
       - *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory)
       - *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory)
    - JVM memory derived from container limits
    - Grafana metrics show no storage bottleneck (IOPS and throughput remain
    well within limits)

*Test Queries*
SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } }

Takes around 80 seconds for our dataset.

SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g {
?s ?p ?o } }

Takes around 120 seconds for our dataset.

*What we’ve observed*

    - The first query is stable once a minimum heap is available.
    - The second query is memory-intensive:
       - On the smaller instance, it will time out once available heap drops
       below a certain threshold.
       - On the larger instance we see clear improvements, but not linear
       scaling.
    - Increasing heap helps to a point, but does not feel like the full
    solution.


*Other things we’ve tried*

    - TDB optimizer, but that isn’t an option with our number of datasets
    and graphs, as far as we can tell.

*Question*
Given this type of workload and dataset size, what other routes should we
consider to improve performance, beyond simply adjusting heap size?

In production, we have multiple datasets consisting of millions of triples,
and our end goal is to improve query times for our users.

Any guidance or pointers would be much appreciated.

Thanks in advance.

Re: Increasing Apache Jena Performance

Reply via email to