Hey Rob,

I've changed the subject so as to not derail the memory optimization thread.

Given how many times this topic has come up on this list, and how often
Jena is struggling with the types of queries and the sizes of datasets that
it should be able to handle in theory, maybe the problem is the Java-based
architecture of TDB?

Maybe it requires a new type of persistence backend such as RocksDB? We
know there is some prototype of this: https://github.com/afs/TDB3
We also know that Stardog is using RocksDB for storage:
https://docs.stardog.com/operating-stardog/database-administration/storage-optimize

IMO the lack of scalable open-source triplestores is one of the main pain
points in the RDF ecosystem.
I love Jena and Fuseki and I'm using it as the default triplestore in my
projects, but I have doubts whether I could use it in a high-load
production system.

I also know this is an open-source project with limited resources, but that
is a different topic.

Martynas
atomgraph.com

On Thu, Jan 15, 2026 at 2:34 PM Rob @ DNR <[email protected]> wrote:

> Hi Vince
>
> > JVM memory derived from container limits
>
> What do you mean by this specifically?
>
> As has been discussed and referenced previously on this list, and is noted
> in our TDB FAQs [1], much of the memory usage for TDB databases is in terms
> of off-heap usage via memory mapped files.
>
> Therefore, setting the JVM heap too high can actually reduce performance
> as the JVM is then competing against the OS for memory and forcing the
> mapped files to be paged out reducing performance.
>
> So firstly, I’d make sure you aren’t setting the JVM heap to use too much
> of your available memory.  Ensure you are leaving some headroom between JVM
> heap and container limits for OS usage for the memory mapped files.  Since
> you mention you have Grafana in place I’d also look at any metrics that
> might be available around memory mapped file usage/paging etc. to see if
> this might be your problem.
>
> > The second query is memory-intensive
>
> Yes, operators like DISTINCT that require the query engine to keep large
> chunks of the data in-memory are always going to be memory-intensive.  The
> Jena query engine is generally designed for lazy streaming and calculation
> of results as much as possible.  If you have control over queries being
> issued then I would look at whether you actually need to be using operators
> like DISTINCT in your queries.
>
> > TDB optimizer, but that isn’t an option with our number of datasets
>    and graphs, as far as we can tell
>
> Don’t really follow this statement.  I assume you’re referring to the
> optional stats based optimiser?  Unless your datasets are being frequently
> updated, I don’t see why you wouldn’t gain some value from generating the
> stats for each dataset.  Remember that the TDB optimizer works on a
> per-dataset basis so you can generate stats files for each dataset, or some
> subset of your datasets, placing each stats file into the relevant database
> directory and they don’t interact with each other.
>
> > In production, we have multiple datasets consisting of millions of
> triples,
> and our end goal is to improve query times for our users.
>
> Often the best way to improve query times for users is either to exert
> more control over the queries (if the queries aren’t end-user controlled)
> using tools like Jena’s qparse [2] to analyse your queries and experiment
> with modifications to them that might optimise better.  Or if you permit
> arbitrary queries to educate/train your users on best practises/how to
> write better queries/SPARQL optimisation etc.
>
> Another thing to consider is that if you aren’t doing federated queries
> across your multiple datasets you might actually be better off having
> independent smaller instances of Fuseki running on smaller AWS nodes, each
> serving a separate dataset.  This would give you more flexibility to tune
> the resources, JVM heap etc. for each dataset depending on its needs.
>
> Hope this helps,
>
> Rob
>
>
> [1] https://jena.apache.org/documentation/tdb/faqs.html#java-heap
> [2] https://jena.apache.org/documentation/query/explain.html
>
> From: Vince Wouters via users <[email protected]>
> Date: Thursday, 15 January 2026 at 12:25
> To: [email protected] <[email protected]>
> Cc: Vince Wouters <[email protected]>
> Subject: Increasing Apache Jena Performance
>
> Hello Jena community,
>
> We’re looking for guidance on what other avenues are worth exploring to
> improve overall query performance on our Apache Jena Fuseki instance, which
> consists of multiple datasets, each containing millions of triples.
>
>
> *Setup*
>
>    - Apache Jena Fuseki *5.5.0*
>    - TDB2-backed datasets
>    - Running on *AWS EKS (Kubernetes)*
>    - Dataset size: ~15.6 million triples
>
>
> *Infrastructure*
>
>    - Instances tested:
>       - *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory)
>       - *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory)
>    - JVM memory derived from container limits
>    - Grafana metrics show no storage bottleneck (IOPS and throughput remain
>    well within limits)
>
> *Test Queries*
> SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } }
>
> Takes around 80 seconds for our dataset.
>
> SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g {
> ?s ?p ?o } }
>
> Takes around 120 seconds for our dataset.
>
> *What we’ve observed*
>
>    - The first query is stable once a minimum heap is available.
>    - The second query is memory-intensive:
>       - On the smaller instance, it will time out once available heap drops
>       below a certain threshold.
>       - On the larger instance we see clear improvements, but not linear
>       scaling.
>    - Increasing heap helps to a point, but does not feel like the full
>    solution.
>
>
> *Other things we’ve tried*
>
>    - TDB optimizer, but that isn’t an option with our number of datasets
>    and graphs, as far as we can tell.
>
> *Question*
> Given this type of workload and dataset size, what other routes should we
> consider to improve performance, beyond simply adjusting heap size?
>
> In production, we have multiple datasets consisting of millions of triples,
> and our end goal is to improve query times for our users.
>
> Any guidance or pointers would be much appreciated.
>
> Thanks in advance.
>

Reply via email to