On 11/03/2024 14:35, Gaspar Bartalus wrote:
Hi Andy,

On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne <[email protected]> wrote:



On 08/03/2024 10:40, Gaspar Bartalus wrote:
Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences between
the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?

Linear increase between compactions or across compactions? The latter
sounds like the previous version hasn't been deleted.


Across compactions, increasing linearly over several days, with compactions
running every day. The compaction is used with the "deleteOld" parameter,
and there is only one Data- folder in the volume, so I assume compaction
itself works as expected.

Strange - I can't explain that. Could you check that there is only one Data-NNNN directory inside the database directory?

What's the disk storage setup? e.g filesystem type.

    Andy

TDB uses sparse files. It allocates 8M chunks per index but that isn't
used immediately. Sparse files are reported differently by different
tools and also differently by different operating systems. I don't know
how k3s is managing the storage.

Sometimes it's the size of the file, sometimes it's the amount of space
in use. For small databases, there is quite a difference.

An empty database is around 220kbytes but you'll see many 8Mbyte files
with "ls -l".

If you zip the database up, and unpack it then it's 193Mbytes.

After a compaction, the previous version of storage can be deleted. The
directory "Data-..." - only the highest numbered directory is used. A
previous one can be zipped up for backup.

The heap memory has some very minimal peaks, saw-tooth, but otherwise
it's
flat.

At what amount of memory?


At ~7GB.



Regards,
Gaspar

On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne <[email protected]> wrote:



On 07/03/2024 13:24, Gaspar Bartalus wrote:
Dear Jena support team,

We would like to ask you to help us in configuring the memory for our
jena-fuseki instance running in kubernetes.

*We have the following setup:*

* Jena-fuseki deployed as StatefulSet to a k8s cluster with the
resource config:

Limits:
    cpu:     2
    memory:  16Gi
Requests:
    cpu:     100m
    memory:  11Gi

* The JVM_ARGS has the following value: -Xmx10G

* Our main dataset of type TDB2 contains ~1 million triples.
A million triples doesn't take up much RAM even in a memory dataset.

In Java, the JVM will grow until it is close to the -Xmx figure. A major
GC will then free up a lot of memory. But the JVM does not give the
memory back to the kernel.

TDB2 does not only use heap space. A heap of 2-4G is usually enough per
dataset, sometimes less (data shape depenendent - e.g. many large
literals used more space.

Use a profiler to examine the heap in-use, you'll probably see a
saw-tooth shape.
Force a GC and see the level of in-use memory afterwards.
Add some safety margin and work space for requests and try that as the
heap size.

*  We execute the following type of UPDATE operations:
     - There are triggers in the system (e.g. users of the application
changing the data) which start ~50 other update operations containing
up to ~30K triples. Most of them run in parallel, some are delayed
with seconds or minutes.
     - There are scheduled UPDATE operations (executed on hourly basis)
containing 30K-500K triples.
     - These UPDATE operations usually delete and insert the same amount
of triples in the dataset. We use the compact API as a nightly job.

*We are noticing the following behaviour:*

* Fuseki consumes 5-10G of heap memory continuously, as configured in
the JVM_ARGS.

* There are points in time when the volume usage of the k8s container
starts to increase suddenly. This does not drop even though compaction
is successfully executed and the dataset size (triple count) does not
increase. See attachment below.

*Our suspicions:*

* garbage collection in Java is often delayed; memory is not freed as
quickly as we would expect it, and the heap limit is reached quickly
if multiple parallel queries are run
* long running database queries can send regular memory to Gen2, that
is not actively cleaned by the garbage collector
* memory-mapped files are also garbage-collected (and perhaps they
could go to Gen2 as well, using more and more storage space).

Could you please explain the possible reasons behind such a behaviour?
And finally could you please suggest a more appropriate configuration
for our use case?

Thanks in advance and best wishes,
Gaspar Bartalus





Reply via email to