On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne <[email protected]> wrote:
> > > On 11/03/2024 14:35, Gaspar Bartalus wrote: > > Hi Andy, > > > > On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne <[email protected]> wrote: > > > >> > >> > >> On 08/03/2024 10:40, Gaspar Bartalus wrote: > >>> Hi, > >>> > >>> Thanks for the responses. > >>> > >>> We were actually curious if you'd have some explanation for the > >>> linear increase in the storage, and why we are seeing differences > between > >>> the actual size of our dataset and the size it uses on disk. (Changes > >>> between `df -h` and `du -lh`)? > >> > >> Linear increase between compactions or across compactions? The latter > >> sounds like the previous version hasn't been deleted. > >> > > > > Across compactions, increasing linearly over several days, with > compactions > > running every day. The compaction is used with the "deleteOld" parameter, > > and there is only one Data- folder in the volume, so I assume compaction > > itself works as expected. > > Strange - I can't explain that. Could you check that there is only one > Data-NNNN directory inside the database directory? > Yes, there is surely just one Data-NNNN folder in the database directory. > > What's the disk storage setup? e.g filesystem type. > We have an Azure disk of type Standard SSD LRS with a filesystem of type Ext4. > > Andy > > >> TDB uses sparse files. It allocates 8M chunks per index but that isn't > >> used immediately. Sparse files are reported differently by different > >> tools and also differently by different operating systems. I don't know > >> how k3s is managing the storage. > >> > >> Sometimes it's the size of the file, sometimes it's the amount of space > >> in use. For small databases, there is quite a difference. > >> > >> An empty database is around 220kbytes but you'll see many 8Mbyte files > >> with "ls -l". > >> > >> If you zip the database up, and unpack it then it's 193Mbytes. > >> > >> After a compaction, the previous version of storage can be deleted. The > >> directory "Data-..." - only the highest numbered directory is used. A > >> previous one can be zipped up for backup. > >> > >>> The heap memory has some very minimal peaks, saw-tooth, but otherwise > >> it's > >>> flat. > >> > >> At what amount of memory? > >> > > > > At ~7GB. > > > >> > >>> > >>> Regards, > >>> Gaspar > >>> > >>> On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne <[email protected]> wrote: > >>> > >>>> > >>>> > >>>> On 07/03/2024 13:24, Gaspar Bartalus wrote: > >>>>> Dear Jena support team, > >>>>> > >>>>> We would like to ask you to help us in configuring the memory for our > >>>>> jena-fuseki instance running in kubernetes. > >>>>> > >>>>> *We have the following setup:* > >>>>> > >>>>> * Jena-fuseki deployed as StatefulSet to a k8s cluster with the > >>>>> resource config: > >>>>> > >>>>> Limits: > >>>>> cpu: 2 > >>>>> memory: 16Gi > >>>>> Requests: > >>>>> cpu: 100m > >>>>> memory: 11Gi > >>>>> > >>>>> * The JVM_ARGS has the following value: -Xmx10G > >>>>> > >>>>> * Our main dataset of type TDB2 contains ~1 million triples. > >>>> A million triples doesn't take up much RAM even in a memory dataset. > >>>> > >>>> In Java, the JVM will grow until it is close to the -Xmx figure. A > major > >>>> GC will then free up a lot of memory. But the JVM does not give the > >>>> memory back to the kernel. > >>>> > >>>> TDB2 does not only use heap space. A heap of 2-4G is usually enough > per > >>>> dataset, sometimes less (data shape depenendent - e.g. many large > >>>> literals used more space. > >>>> > >>>> Use a profiler to examine the heap in-use, you'll probably see a > >>>> saw-tooth shape. > >>>> Force a GC and see the level of in-use memory afterwards. > >>>> Add some safety margin and work space for requests and try that as the > >>>> heap size. > >>>> > >>>>> * We execute the following type of UPDATE operations: > >>>>> - There are triggers in the system (e.g. users of the > application > >>>>> changing the data) which start ~50 other update operations containing > >>>>> up to ~30K triples. Most of them run in parallel, some are delayed > >>>>> with seconds or minutes. > >>>>> - There are scheduled UPDATE operations (executed on hourly > basis) > >>>>> containing 30K-500K triples. > >>>>> - These UPDATE operations usually delete and insert the same > amount > >>>>> of triples in the dataset. We use the compact API as a nightly job. > >>>>> > >>>>> *We are noticing the following behaviour:* > >>>>> > >>>>> * Fuseki consumes 5-10G of heap memory continuously, as configured in > >>>>> the JVM_ARGS. > >>>>> > >>>>> * There are points in time when the volume usage of the k8s container > >>>>> starts to increase suddenly. This does not drop even though > compaction > >>>>> is successfully executed and the dataset size (triple count) does not > >>>>> increase. See attachment below. > >>>>> > >>>>> *Our suspicions:* > >>>>> > >>>>> * garbage collection in Java is often delayed; memory is not freed as > >>>>> quickly as we would expect it, and the heap limit is reached quickly > >>>>> if multiple parallel queries are run > >>>>> * long running database queries can send regular memory to Gen2, that > >>>>> is not actively cleaned by the garbage collector > >>>>> * memory-mapped files are also garbage-collected (and perhaps they > >>>>> could go to Gen2 as well, using more and more storage space). > >>>>> > >>>>> Could you please explain the possible reasons behind such a > behaviour? > >>>>> And finally could you please suggest a more appropriate configuration > >>>>> for our use case? > >>>>> > >>>>> Thanks in advance and best wishes, > >>>>> Gaspar Bartalus > >>>>> > >>>> > >>> > >> > > >
