Hi Andy, On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne <[email protected]> wrote:
> > > On 08/03/2024 10:40, Gaspar Bartalus wrote: > > Hi, > > > > Thanks for the responses. > > > > We were actually curious if you'd have some explanation for the > > linear increase in the storage, and why we are seeing differences between > > the actual size of our dataset and the size it uses on disk. (Changes > > between `df -h` and `du -lh`)? > > Linear increase between compactions or across compactions? The latter > sounds like the previous version hasn't been deleted. > Across compactions, increasing linearly over several days, with compactions running every day. The compaction is used with the "deleteOld" parameter, and there is only one Data- folder in the volume, so I assume compaction itself works as expected. > > TDB uses sparse files. It allocates 8M chunks per index but that isn't > used immediately. Sparse files are reported differently by different > tools and also differently by different operating systems. I don't know > how k3s is managing the storage. > > Sometimes it's the size of the file, sometimes it's the amount of space > in use. For small databases, there is quite a difference. > > An empty database is around 220kbytes but you'll see many 8Mbyte files > with "ls -l". > > If you zip the database up, and unpack it then it's 193Mbytes. > > After a compaction, the previous version of storage can be deleted. The > directory "Data-..." - only the highest numbered directory is used. A > previous one can be zipped up for backup. > > > The heap memory has some very minimal peaks, saw-tooth, but otherwise > it's > > flat. > > At what amount of memory? > At ~7GB. > > > > > Regards, > > Gaspar > > > > On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne <[email protected]> wrote: > > > >> > >> > >> On 07/03/2024 13:24, Gaspar Bartalus wrote: > >>> Dear Jena support team, > >>> > >>> We would like to ask you to help us in configuring the memory for our > >>> jena-fuseki instance running in kubernetes. > >>> > >>> *We have the following setup:* > >>> > >>> * Jena-fuseki deployed as StatefulSet to a k8s cluster with the > >>> resource config: > >>> > >>> Limits: > >>> cpu: 2 > >>> memory: 16Gi > >>> Requests: > >>> cpu: 100m > >>> memory: 11Gi > >>> > >>> * The JVM_ARGS has the following value: -Xmx10G > >>> > >>> * Our main dataset of type TDB2 contains ~1 million triples. > >> A million triples doesn't take up much RAM even in a memory dataset. > >> > >> In Java, the JVM will grow until it is close to the -Xmx figure. A major > >> GC will then free up a lot of memory. But the JVM does not give the > >> memory back to the kernel. > >> > >> TDB2 does not only use heap space. A heap of 2-4G is usually enough per > >> dataset, sometimes less (data shape depenendent - e.g. many large > >> literals used more space. > >> > >> Use a profiler to examine the heap in-use, you'll probably see a > >> saw-tooth shape. > >> Force a GC and see the level of in-use memory afterwards. > >> Add some safety margin and work space for requests and try that as the > >> heap size. > >> > >>> * We execute the following type of UPDATE operations: > >>> - There are triggers in the system (e.g. users of the application > >>> changing the data) which start ~50 other update operations containing > >>> up to ~30K triples. Most of them run in parallel, some are delayed > >>> with seconds or minutes. > >>> - There are scheduled UPDATE operations (executed on hourly basis) > >>> containing 30K-500K triples. > >>> - These UPDATE operations usually delete and insert the same amount > >>> of triples in the dataset. We use the compact API as a nightly job. > >>> > >>> *We are noticing the following behaviour:* > >>> > >>> * Fuseki consumes 5-10G of heap memory continuously, as configured in > >>> the JVM_ARGS. > >>> > >>> * There are points in time when the volume usage of the k8s container > >>> starts to increase suddenly. This does not drop even though compaction > >>> is successfully executed and the dataset size (triple count) does not > >>> increase. See attachment below. > >>> > >>> *Our suspicions:* > >>> > >>> * garbage collection in Java is often delayed; memory is not freed as > >>> quickly as we would expect it, and the heap limit is reached quickly > >>> if multiple parallel queries are run > >>> * long running database queries can send regular memory to Gen2, that > >>> is not actively cleaned by the garbage collector > >>> * memory-mapped files are also garbage-collected (and perhaps they > >>> could go to Gen2 as well, using more and more storage space). > >>> > >>> Could you please explain the possible reasons behind such a behaviour? > >>> And finally could you please suggest a more appropriate configuration > >>> for our use case? > >>> > >>> Thanks in advance and best wishes, > >>> Gaspar Bartalus > >>> > >> > > >
