On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne <[email protected]> wrote:

>
>
> On 11/03/2024 14:35, Gaspar Bartalus wrote:
> > Hi Andy,
> >
> > On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne <[email protected]> wrote:
> >
> >>
> >>
> >> On 08/03/2024 10:40, Gaspar Bartalus wrote:
> >>> Hi,
> >>>
> >>> Thanks for the responses.
> >>>
> >>> We were actually curious if you'd have some explanation for the
> >>> linear increase in the storage, and why we are seeing differences
> between
> >>> the actual size of our dataset and the size it uses on disk. (Changes
> >>> between `df -h` and `du -lh`)?
> >>
> >> Linear increase between compactions or across compactions? The latter
> >> sounds like the previous version hasn't been deleted.
> >>
> >
> > Across compactions, increasing linearly over several days, with
> compactions
> > running every day. The compaction is used with the "deleteOld" parameter,
> > and there is only one Data- folder in the volume, so I assume compaction
> > itself works as expected.
>
> Strange - I can't explain that. Could you check that there is only one
> Data-NNNN directory inside the database directory?
>

Yes, there is surely just one Data-NNNN folder in the database directory.

>
> What's the disk storage setup? e.g filesystem type.
>

We have an Azure disk of type Standard SSD LRS with a filesystem of type
Ext4.

>
>      Andy
>
> >> TDB uses sparse files. It allocates 8M chunks per index but that isn't
> >> used immediately. Sparse files are reported differently by different
> >> tools and also differently by different operating systems. I don't know
> >> how k3s is managing the storage.
> >>
> >> Sometimes it's the size of the file, sometimes it's the amount of space
> >> in use. For small databases, there is quite a difference.
> >>
> >> An empty database is around 220kbytes but you'll see many 8Mbyte files
> >> with "ls -l".
> >>
> >> If you zip the database up, and unpack it then it's 193Mbytes.
> >>
> >> After a compaction, the previous version of storage can be deleted. The
> >> directory "Data-..." - only the highest numbered directory is used. A
> >> previous one can be zipped up for backup.
> >>
> >>> The heap memory has some very minimal peaks, saw-tooth, but otherwise
> >> it's
> >>> flat.
> >>
> >> At what amount of memory?
> >>
> >
> > At ~7GB.
> >
> >>
> >>>
> >>> Regards,
> >>> Gaspar
> >>>
> >>> On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne <[email protected]> wrote:
> >>>
> >>>>
> >>>>
> >>>> On 07/03/2024 13:24, Gaspar Bartalus wrote:
> >>>>> Dear Jena support team,
> >>>>>
> >>>>> We would like to ask you to help us in configuring the memory for our
> >>>>> jena-fuseki instance running in kubernetes.
> >>>>>
> >>>>> *We have the following setup:*
> >>>>>
> >>>>> * Jena-fuseki deployed as StatefulSet to a k8s cluster with the
> >>>>> resource config:
> >>>>>
> >>>>> Limits:
> >>>>>     cpu:     2
> >>>>>     memory:  16Gi
> >>>>> Requests:
> >>>>>     cpu:     100m
> >>>>>     memory:  11Gi
> >>>>>
> >>>>> * The JVM_ARGS has the following value: -Xmx10G
> >>>>>
> >>>>> * Our main dataset of type TDB2 contains ~1 million triples.
> >>>> A million triples doesn't take up much RAM even in a memory dataset.
> >>>>
> >>>> In Java, the JVM will grow until it is close to the -Xmx figure. A
> major
> >>>> GC will then free up a lot of memory. But the JVM does not give the
> >>>> memory back to the kernel.
> >>>>
> >>>> TDB2 does not only use heap space. A heap of 2-4G is usually enough
> per
> >>>> dataset, sometimes less (data shape depenendent - e.g. many large
> >>>> literals used more space.
> >>>>
> >>>> Use a profiler to examine the heap in-use, you'll probably see a
> >>>> saw-tooth shape.
> >>>> Force a GC and see the level of in-use memory afterwards.
> >>>> Add some safety margin and work space for requests and try that as the
> >>>> heap size.
> >>>>
> >>>>> *  We execute the following type of UPDATE operations:
> >>>>>      - There are triggers in the system (e.g. users of the
> application
> >>>>> changing the data) which start ~50 other update operations containing
> >>>>> up to ~30K triples. Most of them run in parallel, some are delayed
> >>>>> with seconds or minutes.
> >>>>>      - There are scheduled UPDATE operations (executed on hourly
> basis)
> >>>>> containing 30K-500K triples.
> >>>>>      - These UPDATE operations usually delete and insert the same
> amount
> >>>>> of triples in the dataset. We use the compact API as a nightly job.
> >>>>>
> >>>>> *We are noticing the following behaviour:*
> >>>>>
> >>>>> * Fuseki consumes 5-10G of heap memory continuously, as configured in
> >>>>> the JVM_ARGS.
> >>>>>
> >>>>> * There are points in time when the volume usage of the k8s container
> >>>>> starts to increase suddenly. This does not drop even though
> compaction
> >>>>> is successfully executed and the dataset size (triple count) does not
> >>>>> increase. See attachment below.
> >>>>>
> >>>>> *Our suspicions:*
> >>>>>
> >>>>> * garbage collection in Java is often delayed; memory is not freed as
> >>>>> quickly as we would expect it, and the heap limit is reached quickly
> >>>>> if multiple parallel queries are run
> >>>>> * long running database queries can send regular memory to Gen2, that
> >>>>> is not actively cleaned by the garbage collector
> >>>>> * memory-mapped files are also garbage-collected (and perhaps they
> >>>>> could go to Gen2 as well, using more and more storage space).
> >>>>>
> >>>>> Could you please explain the possible reasons behind such a
> behaviour?
> >>>>> And finally could you please suggest a more appropriate configuration
> >>>>> for our use case?
> >>>>>
> >>>>> Thanks in advance and best wishes,
> >>>>> Gaspar Bartalus
> >>>>>
> >>>>
> >>>
> >>
> >
>

Reply via email to