Hi Martynas,

Java itself is not the issue, especially modern Java which can use off-heap memory easily (see Project Panama).

You can even beat the amazingly fast DuckDB in *pure Java* if you use all its modern capabilities: https://link.springer.com/article/10.1007/s00778-023-00784-2

Eclipse RDF4J has an LMDB-backed store that has seen a lot of performance improvements in the recent months. Only LMDB itself is in C (similar to RocksDB), but everything else is in Java. It's much faster than TDB2 in our use cases – you may want to give it a try. The RDF4J query optimizer is not as good as Jena's, however.

TDB3 is the right direction. I personally think that RocksDB is a much better library (in almost every aspect) than LMDB.

From what I understand, Stardog has much more code than just RocksDB written in C++, also the triple indexing structure (this is what I remember from their paper on BARQ). This is however much more difficult to maintain.

Piotr Sowiński
NeverBlink
On 1/15/26 15:00, Martynas Jusevičius wrote:
Hey Rob,

I've changed the subject so as to not derail the memory optimization thread.

Given how many times this topic has come up on this list, and how often
Jena is struggling with the types of queries and the sizes of datasets that
it should be able to handle in theory, maybe the problem is the Java-based
architecture of TDB?

Maybe it requires a new type of persistence backend such as RocksDB? We
know there is some prototype of this: https://github.com/afs/TDB3
We also know that Stardog is using RocksDB for storage:
https://docs.stardog.com/operating-stardog/database-administration/storage-optimize

IMO the lack of scalable open-source triplestores is one of the main pain
points in the RDF ecosystem.
I love Jena and Fuseki and I'm using it as the default triplestore in my
projects, but I have doubts whether I could use it in a high-load
production system.

I also know this is an open-source project with limited resources, but that
is a different topic.

Martynas
atomgraph.com

On Thu, Jan 15, 2026 at 2:34 PM Rob @ DNR <[email protected]> wrote:

Hi Vince

JVM memory derived from container limits
What do you mean by this specifically?

As has been discussed and referenced previously on this list, and is noted
in our TDB FAQs [1], much of the memory usage for TDB databases is in terms
of off-heap usage via memory mapped files.

Therefore, setting the JVM heap too high can actually reduce performance
as the JVM is then competing against the OS for memory and forcing the
mapped files to be paged out reducing performance.

So firstly, I’d make sure you aren’t setting the JVM heap to use too much
of your available memory.  Ensure you are leaving some headroom between JVM
heap and container limits for OS usage for the memory mapped files.  Since
you mention you have Grafana in place I’d also look at any metrics that
might be available around memory mapped file usage/paging etc. to see if
this might be your problem.

The second query is memory-intensive
Yes, operators like DISTINCT that require the query engine to keep large
chunks of the data in-memory are always going to be memory-intensive.  The
Jena query engine is generally designed for lazy streaming and calculation
of results as much as possible.  If you have control over queries being
issued then I would look at whether you actually need to be using operators
like DISTINCT in your queries.

TDB optimizer, but that isn’t an option with our number of datasets
    and graphs, as far as we can tell

Don’t really follow this statement.  I assume you’re referring to the
optional stats based optimiser?  Unless your datasets are being frequently
updated, I don’t see why you wouldn’t gain some value from generating the
stats for each dataset.  Remember that the TDB optimizer works on a
per-dataset basis so you can generate stats files for each dataset, or some
subset of your datasets, placing each stats file into the relevant database
directory and they don’t interact with each other.

In production, we have multiple datasets consisting of millions of
triples,
and our end goal is to improve query times for our users.

Often the best way to improve query times for users is either to exert
more control over the queries (if the queries aren’t end-user controlled)
using tools like Jena’s qparse [2] to analyse your queries and experiment
with modifications to them that might optimise better.  Or if you permit
arbitrary queries to educate/train your users on best practises/how to
write better queries/SPARQL optimisation etc.

Another thing to consider is that if you aren’t doing federated queries
across your multiple datasets you might actually be better off having
independent smaller instances of Fuseki running on smaller AWS nodes, each
serving a separate dataset.  This would give you more flexibility to tune
the resources, JVM heap etc. for each dataset depending on its needs.

Hope this helps,

Rob


[1] https://jena.apache.org/documentation/tdb/faqs.html#java-heap
[2] https://jena.apache.org/documentation/query/explain.html

From: Vince Wouters via users <[email protected]>
Date: Thursday, 15 January 2026 at 12:25
To: [email protected] <[email protected]>
Cc: Vince Wouters <[email protected]>
Subject: Increasing Apache Jena Performance

Hello Jena community,

We’re looking for guidance on what other avenues are worth exploring to
improve overall query performance on our Apache Jena Fuseki instance, which
consists of multiple datasets, each containing millions of triples.


*Setup*

    - Apache Jena Fuseki *5.5.0*
    - TDB2-backed datasets
    - Running on *AWS EKS (Kubernetes)*
    - Dataset size: ~15.6 million triples


*Infrastructure*

    - Instances tested:
       - *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory)
       - *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory)
    - JVM memory derived from container limits
    - Grafana metrics show no storage bottleneck (IOPS and throughput remain
    well within limits)

*Test Queries*
SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } }

Takes around 80 seconds for our dataset.

SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g {
?s ?p ?o } }

Takes around 120 seconds for our dataset.

*What we’ve observed*

    - The first query is stable once a minimum heap is available.
    - The second query is memory-intensive:
       - On the smaller instance, it will time out once available heap drops
       below a certain threshold.
       - On the larger instance we see clear improvements, but not linear
       scaling.
    - Increasing heap helps to a point, but does not feel like the full
    solution.


*Other things we’ve tried*

    - TDB optimizer, but that isn’t an option with our number of datasets
    and graphs, as far as we can tell.

*Question*
Given this type of workload and dataset size, what other routes should we
consider to improve performance, beyond simply adjusting heap size?

In production, we have multiple datasets consisting of millions of triples,
and our end goal is to improve query times for our users.

Any guidance or pointers would be much appreciated.

Thanks in advance.

Reply via email to