Hello Jena community,

We’re looking for guidance on what other avenues are worth exploring to
improve overall query performance on our Apache Jena Fuseki instance, which
consists of multiple datasets, each containing millions of triples.


*Setup*

   - Apache Jena Fuseki *5.5.0*
   - TDB2-backed datasets
   - Running on *AWS EKS (Kubernetes)*
   - Dataset size: ~15.6 million triples


*Infrastructure*

   - Instances tested:
      - *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory)
      - *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory)
   - JVM memory derived from container limits
   - Grafana metrics show no storage bottleneck (IOPS and throughput remain
   well within limits)

*Test Queries*
SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } }

Takes around 80 seconds for our dataset.

SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g {
?s ?p ?o } }

Takes around 120 seconds for our dataset.

*What we’ve observed*

   - The first query is stable once a minimum heap is available.
   - The second query is memory-intensive:
      - On the smaller instance, it will time out once available heap drops
      below a certain threshold.
      - On the larger instance we see clear improvements, but not linear
      scaling.
   - Increasing heap helps to a point, but does not feel like the full
   solution.


*Other things we’ve tried*

   - TDB optimizer, but that isn’t an option with our number of datasets
   and graphs, as far as we can tell.

*Question*
Given this type of workload and dataset size, what other routes should we
consider to improve performance, beyond simply adjusting heap size?

In production, we have multiple datasets consisting of millions of triples,
and our end goal is to improve query times for our users.

Any guidance or pointers would be much appreciated.

Thanks in advance.

Reply via email to