Re: Requesting advice on Fuseki memory settings

2024-03-25 Thread Andy Seaborne




On 25/03/2024 07:05, Gaspar Bartalus wrote:

Dear Andy and co.,

Thanks for the support, I think we can close this thread for now.
We will continue to monitor this behaviour and if we can retrieve any
additional useful information then we might reopen it.


Please do pass on any information and techniques for operation 
Fuseki/TDB. There is so much variety "out there" that all reports are 
helpful.


Andy



Best regards,
Gaspar

On Sun, Mar 24, 2024 at 5:00 PM Andy Seaborne  wrote:




On 21/03/2024 09:52, Rob @ DNR wrote:

Gaspar

This probably relates to https://access.redhat.com/solutions/2316

Deleting a file removes it from the file table but doesn’t immediately

free the space if a process is still accessing those files.  That could be
something else inside the container, or in a containerised environment
where the disk space is mounted that could potentially be host processes on
the K8S node that are monitoring the storage.
  >

There’s some suggested debugging steps in the RedHat article about ways

to figure out what processes might still be holding onto the old database
files


Rob


Fuseki does close the database connections after compact but only after
all read transactions on the old database have completed. that can hold
the database open for a while.

Another delay is the ext4 filing system. Deletes will be in the journal
and only when the journal operations are performed will the file system
be released. Usually this happens quickly, but I've seen it take an
appreciable length of time occasionally.

Gaspar wrote:
  > then we start fresh where du -sh and df -h return the same numbers.

This indicates the file space has been release. Restarting clears any
outstanding read-transactions and likely gives the ext4 journal to run
through.

Just about any layer (K8s, VMs) adds delays to real release of the space
but it should happen eventually.

  Andy


From: Gaspar Bartalus 
Date: Wednesday, 20 March 2024 at 11:41
To: users@jena.apache.org 
Subject: Re: Requesting advice on Fuseki memory settings
Hi Andy

On Sat, Mar 16, 2024 at 8:58 PM Andy Seaborne  wrote:




On 12/03/2024 13:17, Gaspar Bartalus wrote:

On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:


On 11/03/2024 14:35, Gaspar Bartalus wrote:

Hi Andy,

On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne

wrote:




On 08/03/2024 10:40, Gaspar Bartalus wrote:

Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences

between

the actual size of our dataset and the size it uses on disk.

(Changes

between `df -h` and `du -lh`)?

Linear increase between compactions or across compactions? The

latter

sounds like the previous version hasn't been deleted.


Across compactions, increasing linearly over several days, with

compactions

running every day. The compaction is used with the "deleteOld"

parameter,

and there is only one Data- folder in the volume, so I assume

compaction

itself works as expected.



Strange - I can't explain that. Could you check that there is only one
Data- directory inside the database directory?


Yes, there is surely just one Data- folder in the database

directory.



What's the disk storage setup? e.g filesystem type.


We have an Azure disk of type Standard SSD LRS with a filesystem of

type

Ext4.


Hi Gaspar,

I still can't explain what your seeing I'm afraid.

Can we get some more details?

When the server has Data-N -- how big (as reported by 'du -sh') is that
directory and how big is the whole directory for the database. They
should be nearly equal.




When a compaction is done, and the server is at Data-(N+1), what are the
sizes of Data-(N+1) and the database directory?



What we see with respect to compaction is usually the following:
- We start with the Data-N folder of ~210MB
- After compaction we have a Data-(N+1) folder of size ~185MB, the old
Data-N being deleted.
- The sizes of the database directory and the Data-* directory are equal.

However when we check with df -h we sometimes see that volume usage is

not

dropping, but on the contrary, it goes up ~140MB after each compaction.



Does stop/starting the server change those numbers?



Yes, then we start fresh where du -sh and df -h return the same numbers.



   Andy







Re: Requesting advice on Fuseki memory settings

2024-03-25 Thread Gaspar Bartalus
Dear Andy and co.,

Thanks for the support, I think we can close this thread for now.
We will continue to monitor this behaviour and if we can retrieve any
additional useful information then we might reopen it.

Best regards,
Gaspar

On Sun, Mar 24, 2024 at 5:00 PM Andy Seaborne  wrote:

>
>
> On 21/03/2024 09:52, Rob @ DNR wrote:
> > Gaspar
> >
> > This probably relates to https://access.redhat.com/solutions/2316
> >
> > Deleting a file removes it from the file table but doesn’t immediately
> free the space if a process is still accessing those files.  That could be
> something else inside the container, or in a containerised environment
> where the disk space is mounted that could potentially be host processes on
> the K8S node that are monitoring the storage.
>  >
> > There’s some suggested debugging steps in the RedHat article about ways
> to figure out what processes might still be holding onto the old database
> files
> >
> > Rob
>
> Fuseki does close the database connections after compact but only after
> all read transactions on the old database have completed. that can hold
> the database open for a while.
>
> Another delay is the ext4 filing system. Deletes will be in the journal
> and only when the journal operations are performed will the file system
> be released. Usually this happens quickly, but I've seen it take an
> appreciable length of time occasionally.
>
> Gaspar wrote:
>  > then we start fresh where du -sh and df -h return the same numbers.
>
> This indicates the file space has been release. Restarting clears any
> outstanding read-transactions and likely gives the ext4 journal to run
> through.
>
> Just about any layer (K8s, VMs) adds delays to real release of the space
> but it should happen eventually.
>
>  Andy
>
> > From: Gaspar Bartalus 
> > Date: Wednesday, 20 March 2024 at 11:41
> > To: users@jena.apache.org 
> > Subject: Re: Requesting advice on Fuseki memory settings
> > Hi Andy
> >
> > On Sat, Mar 16, 2024 at 8:58 PM Andy Seaborne  wrote:
> >
> >>
> >>
> >> On 12/03/2024 13:17, Gaspar Bartalus wrote:
> >>> On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:
> >>>>
> >>>> On 11/03/2024 14:35, Gaspar Bartalus wrote:
> >>>>> Hi Andy,
> >>>>>
> >>>>> On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne
> wrote:
> >>>>>
> >>>>>>
> >>>>>> On 08/03/2024 10:40, Gaspar Bartalus wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Thanks for the responses.
> >>>>>>>
> >>>>>>> We were actually curious if you'd have some explanation for the
> >>>>>>> linear increase in the storage, and why we are seeing differences
> >>>> between
> >>>>>>> the actual size of our dataset and the size it uses on disk.
> (Changes
> >>>>>>> between `df -h` and `du -lh`)?
> >>>>>> Linear increase between compactions or across compactions? The
> latter
> >>>>>> sounds like the previous version hasn't been deleted.
> >>>>>>
> >>>>> Across compactions, increasing linearly over several days, with
> >>>> compactions
> >>>>> running every day. The compaction is used with the "deleteOld"
> >> parameter,
> >>>>> and there is only one Data- folder in the volume, so I assume
> >> compaction
> >>>>> itself works as expected.
> >>
> >>>> Strange - I can't explain that. Could you check that there is only one
> >>>> Data- directory inside the database directory?
> >>>>
> >>> Yes, there is surely just one Data- folder in the database
> directory.
> >>>
> >>>> What's the disk storage setup? e.g filesystem type.
> >>>>
> >>> We have an Azure disk of type Standard SSD LRS with a filesystem of
> type
> >>> Ext4.
> >>
> >> Hi Gaspar,
> >>
> >> I still can't explain what your seeing I'm afraid.
> >>
> >> Can we get some more details?
> >>
> >> When the server has Data-N -- how big (as reported by 'du -sh') is that
> >> directory and how big is the whole directory for the database. They
> >> should be nearly equal.
> >
> >
> >> When a compaction is done, and the server is at Data-(N+1), what are the
> >> sizes of Data-(N+1) and the database directory?
> >>
> >
> > What we see with respect to compaction is usually the following:
> > - We start with the Data-N folder of ~210MB
> > - After compaction we have a Data-(N+1) folder of size ~185MB, the old
> > Data-N being deleted.
> > - The sizes of the database directory and the Data-* directory are equal.
> >
> > However when we check with df -h we sometimes see that volume usage is
> not
> > dropping, but on the contrary, it goes up ~140MB after each compaction.
> >
> >>
> >> Does stop/starting the server change those numbers?
> >>
> >
> > Yes, then we start fresh where du -sh and df -h return the same numbers.
> >
> >>
> >>   Andy
> >>
>


Re: Requesting advice on Fuseki memory settings

2024-03-24 Thread Andy Seaborne




On 21/03/2024 09:52, Rob @ DNR wrote:

Gaspar

This probably relates to https://access.redhat.com/solutions/2316

Deleting a file removes it from the file table but doesn’t immediately free the 
space if a process is still accessing those files.  That could be something 
else inside the container, or in a containerised environment where the disk 
space is mounted that could potentially be host processes on the K8S node that 
are monitoring the storage.

>

There’s some suggested debugging steps in the RedHat article about ways to 
figure out what processes might still be holding onto the old database files

Rob


Fuseki does close the database connections after compact but only after 
all read transactions on the old database have completed. that can hold 
the database open for a while.


Another delay is the ext4 filing system. Deletes will be in the journal 
and only when the journal operations are performed will the file system 
be released. Usually this happens quickly, but I've seen it take an 
appreciable length of time occasionally.


Gaspar wrote:
> then we start fresh where du -sh and df -h return the same numbers.

This indicates the file space has been release. Restarting clears any 
outstanding read-transactions and likely gives the ext4 journal to run 
through.


Just about any layer (K8s, VMs) adds delays to real release of the space 
but it should happen eventually.


Andy


From: Gaspar Bartalus 
Date: Wednesday, 20 March 2024 at 11:41
To: users@jena.apache.org 
Subject: Re: Requesting advice on Fuseki memory settings
Hi Andy

On Sat, Mar 16, 2024 at 8:58 PM Andy Seaborne  wrote:




On 12/03/2024 13:17, Gaspar Bartalus wrote:

On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:


On 11/03/2024 14:35, Gaspar Bartalus wrote:

Hi Andy,

On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:



On 08/03/2024 10:40, Gaspar Bartalus wrote:

Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences

between

the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?

Linear increase between compactions or across compactions? The latter
sounds like the previous version hasn't been deleted.


Across compactions, increasing linearly over several days, with

compactions

running every day. The compaction is used with the "deleteOld"

parameter,

and there is only one Data- folder in the volume, so I assume

compaction

itself works as expected.



Strange - I can't explain that. Could you check that there is only one
Data- directory inside the database directory?


Yes, there is surely just one Data- folder in the database directory.


What's the disk storage setup? e.g filesystem type.


We have an Azure disk of type Standard SSD LRS with a filesystem of type
Ext4.


Hi Gaspar,

I still can't explain what your seeing I'm afraid.

Can we get some more details?

When the server has Data-N -- how big (as reported by 'du -sh') is that
directory and how big is the whole directory for the database. They
should be nearly equal.




When a compaction is done, and the server is at Data-(N+1), what are the
sizes of Data-(N+1) and the database directory?



What we see with respect to compaction is usually the following:
- We start with the Data-N folder of ~210MB
- After compaction we have a Data-(N+1) folder of size ~185MB, the old
Data-N being deleted.
- The sizes of the database directory and the Data-* directory are equal.

However when we check with df -h we sometimes see that volume usage is not
dropping, but on the contrary, it goes up ~140MB after each compaction.



Does stop/starting the server change those numbers?



Yes, then we start fresh where du -sh and df -h return the same numbers.



  Andy



Re: Requesting advice on Fuseki memory settings

2024-03-21 Thread Rob @ DNR
Gaspar

This probably relates to https://access.redhat.com/solutions/2316

Deleting a file removes it from the file table but doesn’t immediately free the 
space if a process is still accessing those files.  That could be something 
else inside the container, or in a containerised environment where the disk 
space is mounted that could potentially be host processes on the K8S node that 
are monitoring the storage.

There’s some suggested debugging steps in the RedHat article about ways to 
figure out what processes might still be holding onto the old database files

Rob

From: Gaspar Bartalus 
Date: Wednesday, 20 March 2024 at 11:41
To: users@jena.apache.org 
Subject: Re: Requesting advice on Fuseki memory settings
Hi Andy

On Sat, Mar 16, 2024 at 8:58 PM Andy Seaborne  wrote:

>
>
> On 12/03/2024 13:17, Gaspar Bartalus wrote:
> > On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:
> >>
> >> On 11/03/2024 14:35, Gaspar Bartalus wrote:
> >>> Hi Andy,
> >>>
> >>> On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:
> >>>
> >>>>
> >>>> On 08/03/2024 10:40, Gaspar Bartalus wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Thanks for the responses.
> >>>>>
> >>>>> We were actually curious if you'd have some explanation for the
> >>>>> linear increase in the storage, and why we are seeing differences
> >> between
> >>>>> the actual size of our dataset and the size it uses on disk. (Changes
> >>>>> between `df -h` and `du -lh`)?
> >>>> Linear increase between compactions or across compactions? The latter
> >>>> sounds like the previous version hasn't been deleted.
> >>>>
> >>> Across compactions, increasing linearly over several days, with
> >> compactions
> >>> running every day. The compaction is used with the "deleteOld"
> parameter,
> >>> and there is only one Data- folder in the volume, so I assume
> compaction
> >>> itself works as expected.
>
> >> Strange - I can't explain that. Could you check that there is only one
> >> Data- directory inside the database directory?
> >>
> > Yes, there is surely just one Data- folder in the database directory.
> >
> >> What's the disk storage setup? e.g filesystem type.
> >>
> > We have an Azure disk of type Standard SSD LRS with a filesystem of type
> > Ext4.
>
> Hi Gaspar,
>
> I still can't explain what your seeing I'm afraid.
>
> Can we get some more details?
>
> When the server has Data-N -- how big (as reported by 'du -sh') is that
> directory and how big is the whole directory for the database. They
> should be nearly equal.


> When a compaction is done, and the server is at Data-(N+1), what are the
> sizes of Data-(N+1) and the database directory?
>

What we see with respect to compaction is usually the following:
- We start with the Data-N folder of ~210MB
- After compaction we have a Data-(N+1) folder of size ~185MB, the old
Data-N being deleted.
- The sizes of the database directory and the Data-* directory are equal.

However when we check with df -h we sometimes see that volume usage is not
dropping, but on the contrary, it goes up ~140MB after each compaction.

>
> Does stop/starting the server change those numbers?
>

Yes, then we start fresh where du -sh and df -h return the same numbers.

>
>  Andy
>


Re: Requesting advice on Fuseki memory settings

2024-03-20 Thread Gaspar Bartalus
Hi Andy

On Sat, Mar 16, 2024 at 8:58 PM Andy Seaborne  wrote:

>
>
> On 12/03/2024 13:17, Gaspar Bartalus wrote:
> > On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:
> >>
> >> On 11/03/2024 14:35, Gaspar Bartalus wrote:
> >>> Hi Andy,
> >>>
> >>> On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:
> >>>
> 
>  On 08/03/2024 10:40, Gaspar Bartalus wrote:
> > Hi,
> >
> > Thanks for the responses.
> >
> > We were actually curious if you'd have some explanation for the
> > linear increase in the storage, and why we are seeing differences
> >> between
> > the actual size of our dataset and the size it uses on disk. (Changes
> > between `df -h` and `du -lh`)?
>  Linear increase between compactions or across compactions? The latter
>  sounds like the previous version hasn't been deleted.
> 
> >>> Across compactions, increasing linearly over several days, with
> >> compactions
> >>> running every day. The compaction is used with the "deleteOld"
> parameter,
> >>> and there is only one Data- folder in the volume, so I assume
> compaction
> >>> itself works as expected.
>
> >> Strange - I can't explain that. Could you check that there is only one
> >> Data- directory inside the database directory?
> >>
> > Yes, there is surely just one Data- folder in the database directory.
> >
> >> What's the disk storage setup? e.g filesystem type.
> >>
> > We have an Azure disk of type Standard SSD LRS with a filesystem of type
> > Ext4.
>
> Hi Gaspar,
>
> I still can't explain what your seeing I'm afraid.
>
> Can we get some more details?
>
> When the server has Data-N -- how big (as reported by 'du -sh') is that
> directory and how big is the whole directory for the database. They
> should be nearly equal.


> When a compaction is done, and the server is at Data-(N+1), what are the
> sizes of Data-(N+1) and the database directory?
>

What we see with respect to compaction is usually the following:
- We start with the Data-N folder of ~210MB
- After compaction we have a Data-(N+1) folder of size ~185MB, the old
Data-N being deleted.
- The sizes of the database directory and the Data-* directory are equal.

However when we check with df -h we sometimes see that volume usage is not
dropping, but on the contrary, it goes up ~140MB after each compaction.

>
> Does stop/starting the server change those numbers?
>

Yes, then we start fresh where du -sh and df -h return the same numbers.

>
>  Andy
>


Re: Requesting advice on Fuseki memory settings

2024-03-16 Thread Andy Seaborne




On 12/03/2024 13:17, Gaspar Bartalus wrote:

On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:


On 11/03/2024 14:35, Gaspar Bartalus wrote:

Hi Andy,

On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:



On 08/03/2024 10:40, Gaspar Bartalus wrote:

Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences

between

the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?

Linear increase between compactions or across compactions? The latter
sounds like the previous version hasn't been deleted.


Across compactions, increasing linearly over several days, with

compactions

running every day. The compaction is used with the "deleteOld" parameter,
and there is only one Data- folder in the volume, so I assume compaction
itself works as expected.



Strange - I can't explain that. Could you check that there is only one
Data- directory inside the database directory?


Yes, there is surely just one Data- folder in the database directory.


What's the disk storage setup? e.g filesystem type.


We have an Azure disk of type Standard SSD LRS with a filesystem of type
Ext4.


Hi Gaspar,

I still can't explain what your seeing I'm afraid.

Can we get some more details?

When the server has Data-N -- how big (as reported by 'du -sh') is that 
directory and how big is the whole directory for the database. They 
should be nearly equal.


When a compaction is done, and the server is at Data-(N+1), what are the 
sizes of Data-(N+1) and the database directory?


Does stop/starting the server change those numbers?

Andy


Re: Requesting advice on Fuseki memory settings

2024-03-12 Thread Gaspar Bartalus
On Mon, Mar 11, 2024 at 6:28 PM Andy Seaborne  wrote:

>
>
> On 11/03/2024 14:35, Gaspar Bartalus wrote:
> > Hi Andy,
> >
> > On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:
> >
> >>
> >>
> >> On 08/03/2024 10:40, Gaspar Bartalus wrote:
> >>> Hi,
> >>>
> >>> Thanks for the responses.
> >>>
> >>> We were actually curious if you'd have some explanation for the
> >>> linear increase in the storage, and why we are seeing differences
> between
> >>> the actual size of our dataset and the size it uses on disk. (Changes
> >>> between `df -h` and `du -lh`)?
> >>
> >> Linear increase between compactions or across compactions? The latter
> >> sounds like the previous version hasn't been deleted.
> >>
> >
> > Across compactions, increasing linearly over several days, with
> compactions
> > running every day. The compaction is used with the "deleteOld" parameter,
> > and there is only one Data- folder in the volume, so I assume compaction
> > itself works as expected.
>
> Strange - I can't explain that. Could you check that there is only one
> Data- directory inside the database directory?
>

Yes, there is surely just one Data- folder in the database directory.

>
> What's the disk storage setup? e.g filesystem type.
>

We have an Azure disk of type Standard SSD LRS with a filesystem of type
Ext4.

>
>  Andy
>
> >> TDB uses sparse files. It allocates 8M chunks per index but that isn't
> >> used immediately. Sparse files are reported differently by different
> >> tools and also differently by different operating systems. I don't know
> >> how k3s is managing the storage.
> >>
> >> Sometimes it's the size of the file, sometimes it's the amount of space
> >> in use. For small databases, there is quite a difference.
> >>
> >> An empty database is around 220kbytes but you'll see many 8Mbyte files
> >> with "ls -l".
> >>
> >> If you zip the database up, and unpack it then it's 193Mbytes.
> >>
> >> After a compaction, the previous version of storage can be deleted. The
> >> directory "Data-..." - only the highest numbered directory is used. A
> >> previous one can be zipped up for backup.
> >>
> >>> The heap memory has some very minimal peaks, saw-tooth, but otherwise
> >> it's
> >>> flat.
> >>
> >> At what amount of memory?
> >>
> >
> > At ~7GB.
> >
> >>
> >>>
> >>> Regards,
> >>> Gaspar
> >>>
> >>> On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:
> >>>
> 
> 
>  On 07/03/2024 13:24, Gaspar Bartalus wrote:
> > Dear Jena support team,
> >
> > We would like to ask you to help us in configuring the memory for our
> > jena-fuseki instance running in kubernetes.
> >
> > *We have the following setup:*
> >
> > * Jena-fuseki deployed as StatefulSet to a k8s cluster with the
> > resource config:
> >
> > Limits:
> > cpu: 2
> > memory:  16Gi
> > Requests:
> > cpu: 100m
> > memory:  11Gi
> >
> > * The JVM_ARGS has the following value: -Xmx10G
> >
> > * Our main dataset of type TDB2 contains ~1 million triples.
>  A million triples doesn't take up much RAM even in a memory dataset.
> 
>  In Java, the JVM will grow until it is close to the -Xmx figure. A
> major
>  GC will then free up a lot of memory. But the JVM does not give the
>  memory back to the kernel.
> 
>  TDB2 does not only use heap space. A heap of 2-4G is usually enough
> per
>  dataset, sometimes less (data shape depenendent - e.g. many large
>  literals used more space.
> 
>  Use a profiler to examine the heap in-use, you'll probably see a
>  saw-tooth shape.
>  Force a GC and see the level of in-use memory afterwards.
>  Add some safety margin and work space for requests and try that as the
>  heap size.
> 
> > *  We execute the following type of UPDATE operations:
> >  - There are triggers in the system (e.g. users of the
> application
> > changing the data) which start ~50 other update operations containing
> > up to ~30K triples. Most of them run in parallel, some are delayed
> > with seconds or minutes.
> >  - There are scheduled UPDATE operations (executed on hourly
> basis)
> > containing 30K-500K triples.
> >  - These UPDATE operations usually delete and insert the same
> amount
> > of triples in the dataset. We use the compact API as a nightly job.
> >
> > *We are noticing the following behaviour:*
> >
> > * Fuseki consumes 5-10G of heap memory continuously, as configured in
> > the JVM_ARGS.
> >
> > * There are points in time when the volume usage of the k8s container
> > starts to increase suddenly. This does not drop even though
> compaction
> > is successfully executed and the dataset size (triple count) does not
> > increase. See attachment below.
> >
> > *Our suspicions:*
> >
> > * garbage collection in Java is often delayed; memory is not freed as
> > 

Re: Requesting advice on Fuseki memory settings

2024-03-11 Thread Andy Seaborne




On 11/03/2024 14:35, Gaspar Bartalus wrote:

Hi Andy,

On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:




On 08/03/2024 10:40, Gaspar Bartalus wrote:

Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences between
the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?


Linear increase between compactions or across compactions? The latter
sounds like the previous version hasn't been deleted.



Across compactions, increasing linearly over several days, with compactions
running every day. The compaction is used with the "deleteOld" parameter,
and there is only one Data- folder in the volume, so I assume compaction
itself works as expected.


Strange - I can't explain that. Could you check that there is only one 
Data- directory inside the database directory?


What's the disk storage setup? e.g filesystem type.

Andy


TDB uses sparse files. It allocates 8M chunks per index but that isn't
used immediately. Sparse files are reported differently by different
tools and also differently by different operating systems. I don't know
how k3s is managing the storage.

Sometimes it's the size of the file, sometimes it's the amount of space
in use. For small databases, there is quite a difference.

An empty database is around 220kbytes but you'll see many 8Mbyte files
with "ls -l".

If you zip the database up, and unpack it then it's 193Mbytes.

After a compaction, the previous version of storage can be deleted. The
directory "Data-..." - only the highest numbered directory is used. A
previous one can be zipped up for backup.


The heap memory has some very minimal peaks, saw-tooth, but otherwise

it's

flat.


At what amount of memory?



At ~7GB.





Regards,
Gaspar

On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:




On 07/03/2024 13:24, Gaspar Bartalus wrote:

Dear Jena support team,

We would like to ask you to help us in configuring the memory for our
jena-fuseki instance running in kubernetes.

*We have the following setup:*

* Jena-fuseki deployed as StatefulSet to a k8s cluster with the
resource config:

Limits:
cpu: 2
memory:  16Gi
Requests:
cpu: 100m
memory:  11Gi

* The JVM_ARGS has the following value: -Xmx10G

* Our main dataset of type TDB2 contains ~1 million triples.

A million triples doesn't take up much RAM even in a memory dataset.

In Java, the JVM will grow until it is close to the -Xmx figure. A major
GC will then free up a lot of memory. But the JVM does not give the
memory back to the kernel.

TDB2 does not only use heap space. A heap of 2-4G is usually enough per
dataset, sometimes less (data shape depenendent - e.g. many large
literals used more space.

Use a profiler to examine the heap in-use, you'll probably see a
saw-tooth shape.
Force a GC and see the level of in-use memory afterwards.
Add some safety margin and work space for requests and try that as the
heap size.


*  We execute the following type of UPDATE operations:
 - There are triggers in the system (e.g. users of the application
changing the data) which start ~50 other update operations containing
up to ~30K triples. Most of them run in parallel, some are delayed
with seconds or minutes.
 - There are scheduled UPDATE operations (executed on hourly basis)
containing 30K-500K triples.
 - These UPDATE operations usually delete and insert the same amount
of triples in the dataset. We use the compact API as a nightly job.

*We are noticing the following behaviour:*

* Fuseki consumes 5-10G of heap memory continuously, as configured in
the JVM_ARGS.

* There are points in time when the volume usage of the k8s container
starts to increase suddenly. This does not drop even though compaction
is successfully executed and the dataset size (triple count) does not
increase. See attachment below.

*Our suspicions:*

* garbage collection in Java is often delayed; memory is not freed as
quickly as we would expect it, and the heap limit is reached quickly
if multiple parallel queries are run
* long running database queries can send regular memory to Gen2, that
is not actively cleaned by the garbage collector
* memory-mapped files are also garbage-collected (and perhaps they
could go to Gen2 as well, using more and more storage space).

Could you please explain the possible reasons behind such a behaviour?
And finally could you please suggest a more appropriate configuration
for our use case?

Thanks in advance and best wishes,
Gaspar Bartalus











Re: Requesting advice on Fuseki memory settings

2024-03-11 Thread Marco Neumann
Hi Gaspar,

if you delete data from the graph you do not effectively remove data from
disk. tdb actually keeps the records on the file system.

search the mailing list and you will find a more detailed response from
Andy.

If you want to make sure to keep the database size on disk to a minimum and
if it suits your use case you can physically remove the folder from disk
and reload the dataset.

Read "disk" here as any kind of storage device.

Best,
Marco


On Fri, Mar 8, 2024 at 10:40 AM Gaspar Bartalus  wrote:

> Hi,
>
> Thanks for the responses.
>
> We were actually curious if you'd have some explanation for the
> linear increase in the storage, and why we are seeing differences between
> the actual size of our dataset and the size it uses on disk. (Changes
> between `df -h` and `du -lh`)?
>
> The heap memory has some very minimal peaks, saw-tooth, but otherwise it's
> flat.
>
> Regards,
> Gaspar
>
> On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:
>
> >
> >
> > On 07/03/2024 13:24, Gaspar Bartalus wrote:
> > > Dear Jena support team,
> > >
> > > We would like to ask you to help us in configuring the memory for our
> > > jena-fuseki instance running in kubernetes.
> > >
> > > *We have the following setup:*
> > >
> > > * Jena-fuseki deployed as StatefulSet to a k8s cluster with the
> > > resource config:
> > >
> > > Limits:
> > >   cpu: 2
> > >   memory:  16Gi
> > > Requests:
> > >   cpu: 100m
> > >   memory:  11Gi
> > >
> > > * The JVM_ARGS has the following value: -Xmx10G
> > >
> > > * Our main dataset of type TDB2 contains ~1 million triples.
> > A million triples doesn't take up much RAM even in a memory dataset.
> >
> > In Java, the JVM will grow until it is close to the -Xmx figure. A major
> > GC will then free up a lot of memory. But the JVM does not give the
> > memory back to the kernel.
> >
> > TDB2 does not only use heap space. A heap of 2-4G is usually enough per
> > dataset, sometimes less (data shape depenendent - e.g. many large
> > literals used more space.
> >
> > Use a profiler to examine the heap in-use, you'll probably see a
> > saw-tooth shape.
> > Force a GC and see the level of in-use memory afterwards.
> > Add some safety margin and work space for requests and try that as the
> > heap size.
> >
> > > *  We execute the following type of UPDATE operations:
> > >- There are triggers in the system (e.g. users of the application
> > > changing the data) which start ~50 other update operations containing
> > > up to ~30K triples. Most of them run in parallel, some are delayed
> > > with seconds or minutes.
> > >- There are scheduled UPDATE operations (executed on hourly basis)
> > > containing 30K-500K triples.
> > >- These UPDATE operations usually delete and insert the same amount
> > > of triples in the dataset. We use the compact API as a nightly job.
> > >
> > > *We are noticing the following behaviour:*
> > >
> > > * Fuseki consumes 5-10G of heap memory continuously, as configured in
> > > the JVM_ARGS.
> > >
> > > * There are points in time when the volume usage of the k8s container
> > > starts to increase suddenly. This does not drop even though compaction
> > > is successfully executed and the dataset size (triple count) does not
> > > increase. See attachment below.
> > >
> > > *Our suspicions:*
> > >
> > > * garbage collection in Java is often delayed; memory is not freed as
> > > quickly as we would expect it, and the heap limit is reached quickly
> > > if multiple parallel queries are run
> > > * long running database queries can send regular memory to Gen2, that
> > > is not actively cleaned by the garbage collector
> > > * memory-mapped files are also garbage-collected (and perhaps they
> > > could go to Gen2 as well, using more and more storage space).
> > >
> > > Could you please explain the possible reasons behind such a behaviour?
> > > And finally could you please suggest a more appropriate configuration
> > > for our use case?
> > >
> > > Thanks in advance and best wishes,
> > > Gaspar Bartalus
> > >
> >
>


-- 


---
Marco Neumann


Re: Requesting advice on Fuseki memory settings

2024-03-11 Thread Gaspar Bartalus
Hi Andy,

On Fri, Mar 8, 2024 at 4:41 PM Andy Seaborne  wrote:

>
>
> On 08/03/2024 10:40, Gaspar Bartalus wrote:
> > Hi,
> >
> > Thanks for the responses.
> >
> > We were actually curious if you'd have some explanation for the
> > linear increase in the storage, and why we are seeing differences between
> > the actual size of our dataset and the size it uses on disk. (Changes
> > between `df -h` and `du -lh`)?
>
> Linear increase between compactions or across compactions? The latter
> sounds like the previous version hasn't been deleted.
>

Across compactions, increasing linearly over several days, with compactions
running every day. The compaction is used with the "deleteOld" parameter,
and there is only one Data- folder in the volume, so I assume compaction
itself works as expected.

>
> TDB uses sparse files. It allocates 8M chunks per index but that isn't
> used immediately. Sparse files are reported differently by different
> tools and also differently by different operating systems. I don't know
> how k3s is managing the storage.
>
> Sometimes it's the size of the file, sometimes it's the amount of space
> in use. For small databases, there is quite a difference.
>
> An empty database is around 220kbytes but you'll see many 8Mbyte files
> with "ls -l".
>
> If you zip the database up, and unpack it then it's 193Mbytes.
>
> After a compaction, the previous version of storage can be deleted. The
> directory "Data-..." - only the highest numbered directory is used. A
> previous one can be zipped up for backup.
>
> > The heap memory has some very minimal peaks, saw-tooth, but otherwise
> it's
> > flat.
>
> At what amount of memory?
>

At ~7GB.

>
> >
> > Regards,
> > Gaspar
> >
> > On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:
> >
> >>
> >>
> >> On 07/03/2024 13:24, Gaspar Bartalus wrote:
> >>> Dear Jena support team,
> >>>
> >>> We would like to ask you to help us in configuring the memory for our
> >>> jena-fuseki instance running in kubernetes.
> >>>
> >>> *We have the following setup:*
> >>>
> >>> * Jena-fuseki deployed as StatefulSet to a k8s cluster with the
> >>> resource config:
> >>>
> >>> Limits:
> >>>cpu: 2
> >>>memory:  16Gi
> >>> Requests:
> >>>cpu: 100m
> >>>memory:  11Gi
> >>>
> >>> * The JVM_ARGS has the following value: -Xmx10G
> >>>
> >>> * Our main dataset of type TDB2 contains ~1 million triples.
> >> A million triples doesn't take up much RAM even in a memory dataset.
> >>
> >> In Java, the JVM will grow until it is close to the -Xmx figure. A major
> >> GC will then free up a lot of memory. But the JVM does not give the
> >> memory back to the kernel.
> >>
> >> TDB2 does not only use heap space. A heap of 2-4G is usually enough per
> >> dataset, sometimes less (data shape depenendent - e.g. many large
> >> literals used more space.
> >>
> >> Use a profiler to examine the heap in-use, you'll probably see a
> >> saw-tooth shape.
> >> Force a GC and see the level of in-use memory afterwards.
> >> Add some safety margin and work space for requests and try that as the
> >> heap size.
> >>
> >>> *  We execute the following type of UPDATE operations:
> >>> - There are triggers in the system (e.g. users of the application
> >>> changing the data) which start ~50 other update operations containing
> >>> up to ~30K triples. Most of them run in parallel, some are delayed
> >>> with seconds or minutes.
> >>> - There are scheduled UPDATE operations (executed on hourly basis)
> >>> containing 30K-500K triples.
> >>> - These UPDATE operations usually delete and insert the same amount
> >>> of triples in the dataset. We use the compact API as a nightly job.
> >>>
> >>> *We are noticing the following behaviour:*
> >>>
> >>> * Fuseki consumes 5-10G of heap memory continuously, as configured in
> >>> the JVM_ARGS.
> >>>
> >>> * There are points in time when the volume usage of the k8s container
> >>> starts to increase suddenly. This does not drop even though compaction
> >>> is successfully executed and the dataset size (triple count) does not
> >>> increase. See attachment below.
> >>>
> >>> *Our suspicions:*
> >>>
> >>> * garbage collection in Java is often delayed; memory is not freed as
> >>> quickly as we would expect it, and the heap limit is reached quickly
> >>> if multiple parallel queries are run
> >>> * long running database queries can send regular memory to Gen2, that
> >>> is not actively cleaned by the garbage collector
> >>> * memory-mapped files are also garbage-collected (and perhaps they
> >>> could go to Gen2 as well, using more and more storage space).
> >>>
> >>> Could you please explain the possible reasons behind such a behaviour?
> >>> And finally could you please suggest a more appropriate configuration
> >>> for our use case?
> >>>
> >>> Thanks in advance and best wishes,
> >>> Gaspar Bartalus
> >>>
> >>
> >
>


Re: Requesting advice on Fuseki memory settings

2024-03-08 Thread Andy Seaborne

Hi Jan,

On 08/03/2024 12:31, Jan Eerdekens wrote:

In our data mesh use case we currently also have serious disk issues
because frequently removing/adding and updating data in a dataset seems to
increase the disk usage a lot. We're currently running frequent compact
calls, but especially on the larger datasets these have the tendency to
stall/not finish which eventually causes the system to run out of storage
(even though the actual amount of data is relatively small).


Is there anything in the log files to indicate what is causing the 
compactions to fail?


Jena 5.0.0 wil have a more robust compaction step for linux and MacOS 
(and native Windows eventually - but that is current unreliable. Windows 
 deleting memory mapped files is a well-known, long time JDK issue)



In the beginning we also had some memory/GC issues, but after assigning
some more memory (we're at 12Gb now), tuning some GC parameters, switching
to SSD and adding some CPU capacity the GC issues seem to be under control.
We're currently also looking into configuring the disk to have more IOPS to
see if that can help with the compacting issues we're seeing now.


What size is your data?

What sort of storage class are you using for the database?

Andy



On Fri, 8 Mar 2024 at 11:40, Gaspar Bartalus  wrote:


Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences between
the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?

The heap memory has some very minimal peaks, saw-tooth, but otherwise it's
flat.

Regards,
Gaspar

On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:




On 07/03/2024 13:24, Gaspar Bartalus wrote:

Dear Jena support team,

We would like to ask you to help us in configuring the memory for our
jena-fuseki instance running in kubernetes.

*We have the following setup:*

* Jena-fuseki deployed as StatefulSet to a k8s cluster with the
resource config:

Limits:
   cpu: 2
   memory:  16Gi
Requests:
   cpu: 100m
   memory:  11Gi

* The JVM_ARGS has the following value: -Xmx10G

* Our main dataset of type TDB2 contains ~1 million triples.

A million triples doesn't take up much RAM even in a memory dataset.

In Java, the JVM will grow until it is close to the -Xmx figure. A major
GC will then free up a lot of memory. But the JVM does not give the
memory back to the kernel.

TDB2 does not only use heap space. A heap of 2-4G is usually enough per
dataset, sometimes less (data shape depenendent - e.g. many large
literals used more space.

Use a profiler to examine the heap in-use, you'll probably see a
saw-tooth shape.
Force a GC and see the level of in-use memory afterwards.
Add some safety margin and work space for requests and try that as the
heap size.


*  We execute the following type of UPDATE operations:
- There are triggers in the system (e.g. users of the application
changing the data) which start ~50 other update operations containing
up to ~30K triples. Most of them run in parallel, some are delayed
with seconds or minutes.
- There are scheduled UPDATE operations (executed on hourly basis)
containing 30K-500K triples.
- These UPDATE operations usually delete and insert the same amount
of triples in the dataset. We use the compact API as a nightly job.

*We are noticing the following behaviour:*

* Fuseki consumes 5-10G of heap memory continuously, as configured in
the JVM_ARGS.

* There are points in time when the volume usage of the k8s container
starts to increase suddenly. This does not drop even though compaction
is successfully executed and the dataset size (triple count) does not
increase. See attachment below.

*Our suspicions:*

* garbage collection in Java is often delayed; memory is not freed as
quickly as we would expect it, and the heap limit is reached quickly
if multiple parallel queries are run
* long running database queries can send regular memory to Gen2, that
is not actively cleaned by the garbage collector
* memory-mapped files are also garbage-collected (and perhaps they
could go to Gen2 as well, using more and more storage space).

Could you please explain the possible reasons behind such a behaviour?
And finally could you please suggest a more appropriate configuration
for our use case?

Thanks in advance and best wishes,
Gaspar Bartalus









Re: Requesting advice on Fuseki memory settings

2024-03-08 Thread Andy Seaborne




On 08/03/2024 10:40, Gaspar Bartalus wrote:

Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences between
the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?


Linear increase between compactions or across compactions? The latter 
sounds like the previous version hasn't been deleted.


TDB uses sparse files. It allocates 8M chunks per index but that isn't 
used immediately. Sparse files are reported differently by different 
tools and also differently by different operating systems. I don't know 
how k3s is managing the storage.


Sometimes it's the size of the file, sometimes it's the amount of space 
in use. For small databases, there is quite a difference.


An empty database is around 220kbytes but you'll see many 8Mbyte files 
with "ls -l".


If you zip the database up, and unpack it then it's 193Mbytes.

After a compaction, the previous version of storage can be deleted. The 
directory "Data-..." - only the highest numbered directory is used. A 
previous one can be zipped up for backup.



The heap memory has some very minimal peaks, saw-tooth, but otherwise it's
flat.


At what amount of memory?



Regards,
Gaspar

On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:




On 07/03/2024 13:24, Gaspar Bartalus wrote:

Dear Jena support team,

We would like to ask you to help us in configuring the memory for our
jena-fuseki instance running in kubernetes.

*We have the following setup:*

* Jena-fuseki deployed as StatefulSet to a k8s cluster with the
resource config:

Limits:
   cpu: 2
   memory:  16Gi
Requests:
   cpu: 100m
   memory:  11Gi

* The JVM_ARGS has the following value: -Xmx10G

* Our main dataset of type TDB2 contains ~1 million triples.

A million triples doesn't take up much RAM even in a memory dataset.

In Java, the JVM will grow until it is close to the -Xmx figure. A major
GC will then free up a lot of memory. But the JVM does not give the
memory back to the kernel.

TDB2 does not only use heap space. A heap of 2-4G is usually enough per
dataset, sometimes less (data shape depenendent - e.g. many large
literals used more space.

Use a profiler to examine the heap in-use, you'll probably see a
saw-tooth shape.
Force a GC and see the level of in-use memory afterwards.
Add some safety margin and work space for requests and try that as the
heap size.


*  We execute the following type of UPDATE operations:
- There are triggers in the system (e.g. users of the application
changing the data) which start ~50 other update operations containing
up to ~30K triples. Most of them run in parallel, some are delayed
with seconds or minutes.
- There are scheduled UPDATE operations (executed on hourly basis)
containing 30K-500K triples.
- These UPDATE operations usually delete and insert the same amount
of triples in the dataset. We use the compact API as a nightly job.

*We are noticing the following behaviour:*

* Fuseki consumes 5-10G of heap memory continuously, as configured in
the JVM_ARGS.

* There are points in time when the volume usage of the k8s container
starts to increase suddenly. This does not drop even though compaction
is successfully executed and the dataset size (triple count) does not
increase. See attachment below.

*Our suspicions:*

* garbage collection in Java is often delayed; memory is not freed as
quickly as we would expect it, and the heap limit is reached quickly
if multiple parallel queries are run
* long running database queries can send regular memory to Gen2, that
is not actively cleaned by the garbage collector
* memory-mapped files are also garbage-collected (and perhaps they
could go to Gen2 as well, using more and more storage space).

Could you please explain the possible reasons behind such a behaviour?
And finally could you please suggest a more appropriate configuration
for our use case?

Thanks in advance and best wishes,
Gaspar Bartalus







Re: Requesting advice on Fuseki memory settings

2024-03-08 Thread Jan Eerdekens
In our data mesh use case we currently also have serious disk issues
because frequently removing/adding and updating data in a dataset seems to
increase the disk usage a lot. We're currently running frequent compact
calls, but especially on the larger datasets these have the tendency to
stall/not finish which eventually causes the system to run out of storage
(even though the actual amount of data is relatively small).

In the beginning we also had some memory/GC issues, but after assigning
some more memory (we're at 12Gb now), tuning some GC parameters, switching
to SSD and adding some CPU capacity the GC issues seem to be under control.
We're currently also looking into configuring the disk to have more IOPS to
see if that can help with the compacting issues we're seeing now.

On Fri, 8 Mar 2024 at 11:40, Gaspar Bartalus  wrote:

> Hi,
>
> Thanks for the responses.
>
> We were actually curious if you'd have some explanation for the
> linear increase in the storage, and why we are seeing differences between
> the actual size of our dataset and the size it uses on disk. (Changes
> between `df -h` and `du -lh`)?
>
> The heap memory has some very minimal peaks, saw-tooth, but otherwise it's
> flat.
>
> Regards,
> Gaspar
>
> On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:
>
> >
> >
> > On 07/03/2024 13:24, Gaspar Bartalus wrote:
> > > Dear Jena support team,
> > >
> > > We would like to ask you to help us in configuring the memory for our
> > > jena-fuseki instance running in kubernetes.
> > >
> > > *We have the following setup:*
> > >
> > > * Jena-fuseki deployed as StatefulSet to a k8s cluster with the
> > > resource config:
> > >
> > > Limits:
> > >   cpu: 2
> > >   memory:  16Gi
> > > Requests:
> > >   cpu: 100m
> > >   memory:  11Gi
> > >
> > > * The JVM_ARGS has the following value: -Xmx10G
> > >
> > > * Our main dataset of type TDB2 contains ~1 million triples.
> > A million triples doesn't take up much RAM even in a memory dataset.
> >
> > In Java, the JVM will grow until it is close to the -Xmx figure. A major
> > GC will then free up a lot of memory. But the JVM does not give the
> > memory back to the kernel.
> >
> > TDB2 does not only use heap space. A heap of 2-4G is usually enough per
> > dataset, sometimes less (data shape depenendent - e.g. many large
> > literals used more space.
> >
> > Use a profiler to examine the heap in-use, you'll probably see a
> > saw-tooth shape.
> > Force a GC and see the level of in-use memory afterwards.
> > Add some safety margin and work space for requests and try that as the
> > heap size.
> >
> > > *  We execute the following type of UPDATE operations:
> > >- There are triggers in the system (e.g. users of the application
> > > changing the data) which start ~50 other update operations containing
> > > up to ~30K triples. Most of them run in parallel, some are delayed
> > > with seconds or minutes.
> > >- There are scheduled UPDATE operations (executed on hourly basis)
> > > containing 30K-500K triples.
> > >- These UPDATE operations usually delete and insert the same amount
> > > of triples in the dataset. We use the compact API as a nightly job.
> > >
> > > *We are noticing the following behaviour:*
> > >
> > > * Fuseki consumes 5-10G of heap memory continuously, as configured in
> > > the JVM_ARGS.
> > >
> > > * There are points in time when the volume usage of the k8s container
> > > starts to increase suddenly. This does not drop even though compaction
> > > is successfully executed and the dataset size (triple count) does not
> > > increase. See attachment below.
> > >
> > > *Our suspicions:*
> > >
> > > * garbage collection in Java is often delayed; memory is not freed as
> > > quickly as we would expect it, and the heap limit is reached quickly
> > > if multiple parallel queries are run
> > > * long running database queries can send regular memory to Gen2, that
> > > is not actively cleaned by the garbage collector
> > > * memory-mapped files are also garbage-collected (and perhaps they
> > > could go to Gen2 as well, using more and more storage space).
> > >
> > > Could you please explain the possible reasons behind such a behaviour?
> > > And finally could you please suggest a more appropriate configuration
> > > for our use case?
> > >
> > > Thanks in advance and best wishes,
> > > Gaspar Bartalus
> > >
> >
>


Re: Requesting advice on Fuseki memory settings

2024-03-08 Thread Gaspar Bartalus
Hi,

Thanks for the responses.

We were actually curious if you'd have some explanation for the
linear increase in the storage, and why we are seeing differences between
the actual size of our dataset and the size it uses on disk. (Changes
between `df -h` and `du -lh`)?

The heap memory has some very minimal peaks, saw-tooth, but otherwise it's
flat.

Regards,
Gaspar

On Thu, Mar 7, 2024 at 11:55 PM Andy Seaborne  wrote:

>
>
> On 07/03/2024 13:24, Gaspar Bartalus wrote:
> > Dear Jena support team,
> >
> > We would like to ask you to help us in configuring the memory for our
> > jena-fuseki instance running in kubernetes.
> >
> > *We have the following setup:*
> >
> > * Jena-fuseki deployed as StatefulSet to a k8s cluster with the
> > resource config:
> >
> > Limits:
> >   cpu: 2
> >   memory:  16Gi
> > Requests:
> >   cpu: 100m
> >   memory:  11Gi
> >
> > * The JVM_ARGS has the following value: -Xmx10G
> >
> > * Our main dataset of type TDB2 contains ~1 million triples.
> A million triples doesn't take up much RAM even in a memory dataset.
>
> In Java, the JVM will grow until it is close to the -Xmx figure. A major
> GC will then free up a lot of memory. But the JVM does not give the
> memory back to the kernel.
>
> TDB2 does not only use heap space. A heap of 2-4G is usually enough per
> dataset, sometimes less (data shape depenendent - e.g. many large
> literals used more space.
>
> Use a profiler to examine the heap in-use, you'll probably see a
> saw-tooth shape.
> Force a GC and see the level of in-use memory afterwards.
> Add some safety margin and work space for requests and try that as the
> heap size.
>
> > *  We execute the following type of UPDATE operations:
> >- There are triggers in the system (e.g. users of the application
> > changing the data) which start ~50 other update operations containing
> > up to ~30K triples. Most of them run in parallel, some are delayed
> > with seconds or minutes.
> >- There are scheduled UPDATE operations (executed on hourly basis)
> > containing 30K-500K triples.
> >- These UPDATE operations usually delete and insert the same amount
> > of triples in the dataset. We use the compact API as a nightly job.
> >
> > *We are noticing the following behaviour:*
> >
> > * Fuseki consumes 5-10G of heap memory continuously, as configured in
> > the JVM_ARGS.
> >
> > * There are points in time when the volume usage of the k8s container
> > starts to increase suddenly. This does not drop even though compaction
> > is successfully executed and the dataset size (triple count) does not
> > increase. See attachment below.
> >
> > *Our suspicions:*
> >
> > * garbage collection in Java is often delayed; memory is not freed as
> > quickly as we would expect it, and the heap limit is reached quickly
> > if multiple parallel queries are run
> > * long running database queries can send regular memory to Gen2, that
> > is not actively cleaned by the garbage collector
> > * memory-mapped files are also garbage-collected (and perhaps they
> > could go to Gen2 as well, using more and more storage space).
> >
> > Could you please explain the possible reasons behind such a behaviour?
> > And finally could you please suggest a more appropriate configuration
> > for our use case?
> >
> > Thanks in advance and best wishes,
> > Gaspar Bartalus
> >
>


Re: Requesting advice on Fuseki memory settings

2024-03-07 Thread Martynas Jusevičius
If it helps, I have a setup I have used to profile Fuseki in VisualVM:
https://github.com/AtomGraph/fuseki-docker



On Thu, 7 Mar 2024 at 22.55, Andy Seaborne  wrote:

>
>
> On 07/03/2024 13:24, Gaspar Bartalus wrote:
> > Dear Jena support team,
> >
> > We would like to ask you to help us in configuring the memory for our
> > jena-fuseki instance running in kubernetes.
> >
> > *We have the following setup:*
> >
> > * Jena-fuseki deployed as StatefulSet to a k8s cluster with the
> > resource config:
> >
> > Limits:
> >   cpu: 2
> >   memory:  16Gi
> > Requests:
> >   cpu: 100m
> >   memory:  11Gi
> >
> > * The JVM_ARGS has the following value: -Xmx10G
> >
> > * Our main dataset of type TDB2 contains ~1 million triples.
> A million triples doesn't take up much RAM even in a memory dataset.
>
> In Java, the JVM will grow until it is close to the -Xmx figure. A major
> GC will then free up a lot of memory. But the JVM does not give the
> memory back to the kernel.
>
> TDB2 does not only use heap space. A heap of 2-4G is usually enough per
> dataset, sometimes less (data shape depenendent - e.g. many large
> literals used more space.
>
> Use a profiler to examine the heap in-use, you'll probably see a
> saw-tooth shape.
> Force a GC and see the level of in-use memory afterwards.
> Add some safety margin and work space for requests and try that as the
> heap size.
>
> > *  We execute the following type of UPDATE operations:
> >- There are triggers in the system (e.g. users of the application
> > changing the data) which start ~50 other update operations containing
> > up to ~30K triples. Most of them run in parallel, some are delayed
> > with seconds or minutes.
> >- There are scheduled UPDATE operations (executed on hourly basis)
> > containing 30K-500K triples.
> >- These UPDATE operations usually delete and insert the same amount
> > of triples in the dataset. We use the compact API as a nightly job.
> >
> > *We are noticing the following behaviour:*
> >
> > * Fuseki consumes 5-10G of heap memory continuously, as configured in
> > the JVM_ARGS.
> >
> > * There are points in time when the volume usage of the k8s container
> > starts to increase suddenly. This does not drop even though compaction
> > is successfully executed and the dataset size (triple count) does not
> > increase. See attachment below.
> >
> > *Our suspicions:*
> >
> > * garbage collection in Java is often delayed; memory is not freed as
> > quickly as we would expect it, and the heap limit is reached quickly
> > if multiple parallel queries are run
> > * long running database queries can send regular memory to Gen2, that
> > is not actively cleaned by the garbage collector
> > * memory-mapped files are also garbage-collected (and perhaps they
> > could go to Gen2 as well, using more and more storage space).
> >
> > Could you please explain the possible reasons behind such a behaviour?
> > And finally could you please suggest a more appropriate configuration
> > for our use case?
> >
> > Thanks in advance and best wishes,
> > Gaspar Bartalus
> >
>


Re: Requesting advice on Fuseki memory settings

2024-03-07 Thread Andy Seaborne



On 07/03/2024 13:24, Gaspar Bartalus wrote:

Dear Jena support team,

We would like to ask you to help us in configuring the memory for our 
jena-fuseki instance running in kubernetes.


*We have the following setup:*

* Jena-fuseki deployed as StatefulSet to a k8s cluster with the 
resource config:


Limits:
  cpu:     2
  memory:  16Gi
Requests:
  cpu:     100m
  memory:  11Gi

* The JVM_ARGS has the following value: -Xmx10G

* Our main dataset of type TDB2 contains ~1 million triples.

A million triples doesn't take up much RAM even in a memory dataset.

In Java, the JVM will grow until it is close to the -Xmx figure. A major 
GC will then free up a lot of memory. But the JVM does not give the 
memory back to the kernel.


TDB2 does not only use heap space. A heap of 2-4G is usually enough per 
dataset, sometimes less (data shape depenendent - e.g. many large 
literals used more space.


Use a profiler to examine the heap in-use, you'll probably see a 
saw-tooth shape.

Force a GC and see the level of in-use memory afterwards.
Add some safety margin and work space for requests and try that as the 
heap size.



*  We execute the following type of UPDATE operations:
   - There are triggers in the system (e.g. users of the application 
changing the data) which start ~50 other update operations containing 
up to ~30K triples. Most of them run in parallel, some are delayed 
with seconds or minutes.
   - There are scheduled UPDATE operations (executed on hourly basis) 
containing 30K-500K triples.
   - These UPDATE operations usually delete and insert the same amount 
of triples in the dataset. We use the compact API as a nightly job.


*We are noticing the following behaviour:*

* Fuseki consumes 5-10G of heap memory continuously, as configured in 
the JVM_ARGS.


* There are points in time when the volume usage of the k8s container 
starts to increase suddenly. This does not drop even though compaction 
is successfully executed and the dataset size (triple count) does not 
increase. See attachment below.


*Our suspicions:*

* garbage collection in Java is often delayed; memory is not freed as 
quickly as we would expect it, and the heap limit is reached quickly 
if multiple parallel queries are run
* long running database queries can send regular memory to Gen2, that 
is not actively cleaned by the garbage collector
* memory-mapped files are also garbage-collected (and perhaps they 
could go to Gen2 as well, using more and more storage space).


Could you please explain the possible reasons behind such a behaviour?
And finally could you please suggest a more appropriate configuration 
for our use case?


Thanks in advance and best wishes,
Gaspar Bartalus



Requesting advice on Fuseki memory settings

2024-03-07 Thread Gaspar Bartalus
Dear Jena support team,

We would like to ask you to help us in configuring the memory for our
jena-fuseki instance running in kubernetes.

*We have the following setup:*

* Jena-fuseki deployed as StatefulSet to a k8s cluster with the resource
config:

Limits:


  cpu: 2


  memory:  16Gi


Requests:


  cpu: 100m


  memory:  11Gi



* The JVM_ARGS has the following value: -Xmx10G

* Our main dataset of type TDB2 contains ~1 million triples.

*  We execute the following type of UPDATE operations:
   - There are triggers in the system (e.g. users of the application
changing the data) which start ~50 other update operations containing up to
~30K triples. Most of them run in parallel, some are delayed with seconds
or minutes.
   - There are scheduled UPDATE operations (executed on hourly basis)
containing 30K-500K triples.
   - These UPDATE operations usually delete and insert the same amount of
triples in the dataset. We use the compact API as a nightly job.

*We are noticing the following behaviour:*

* Fuseki consumes 5-10G of heap memory continuously, as configured in the
JVM_ARGS.

* There are points in time when the volume usage of the k8s container
starts to increase suddenly. This does not drop even though compaction is
successfully executed and the dataset size (triple count) does not
increase. See attachment below.

*Our suspicions:*

* garbage collection in Java is often delayed; memory is not freed as
quickly as we would expect it, and the heap limit is reached quickly if
multiple parallel queries are run
* long running database queries can send regular memory to Gen2, that is
not actively cleaned by the garbage collector
* memory-mapped files are also garbage-collected (and perhaps they could go
to Gen2 as well, using more and more storage space).

Could you please explain the possible reasons behind such a behaviour?
And finally could you please suggest a more appropriate configuration for
our use case?

Thanks in advance and best wishes,
Gaspar Bartalus