Re: Mystery memory leak in fuseki

2023-07-04 Thread Dave Reynolds

Try that again:

For interest this is what the JVM metrics look like. The main 
heap/non-heap ones are:


https://www.dropbox.com/s/g1ih98kprnvjvxx/fusdeki-metrics-1.png?dl=0

So stable at around 75MB used, 110MB committed.

Whereas the buffer pools are:

https://www.dropbox.com/s/c77b2oarzxjlsa7/fuseki-buffer-metrics.png?dl=0

So gets up to a size comparable with the allowed max heap size (500MB) 
then GC back. Lots of churn just for reporting the metrics but no sign 
of the upward trend which dominates the MEM% curves and nothing to 
explain the growth to 1.8GB and beyond


Guess could try doing a heap dump anyway in case that gives a clue but 
not sure that's the right haystack.


Dave

On 04/07/2023 10:56, Dave Reynolds wrote:
For interest this is what the JVM metrics look like. The main 
heap/non-heap ones are:


https://www.dropbox.com/s/8auux5v352ur04m/fusdeki-metrics-1.png?dl=0

So stable at around 75MB used, 110MB committed.

Whereas the buffer pools are:

https://www.dropbox.com/s/c77b2oarzxjlsa7/fuseki-buffer-metrics.png?dl=0

So gets up to a size comparable with the allowed max heap size (500MB) 
then GC back. Lots of churn just for reporting the metrics but no sign 
of the upward trend which dominates the MEM% curves and nothing to 
explain the growth to 1.8GB and beyond


Guess could try doing a heap dump anyway in case that gives a clue but 
not sure that's the right haystack.


Dave


On 04/07/2023 10:41, Dave Reynolds wrote:
 >  Does this only happen in a container?  Or can you reproduce it 
running locally as well?


Not reproduced locally yet, partly because it's harder to set up the 
equivalent metrics monitoring there.


Can try harder at that.

 > If you can reproduce it locally then attaching a profiler like 
VisualVM so you can take a heap snapshot and see where the memory is 
going that would be useful


Thanks, aware of that option but I thought that would just allow us to 
probe the heap, non-heap and buffer JVM memory pools. We have quite 
detailed monitoring traces on all the JVM metrics which confirms heap 
and non-heap are all fine, sitting stably at a low level and not 
reflecting the leak.


That's also what tells us the direct memory buffers are cycling but 
being properly collected and not leaking. Assuming the JVM metrics are 
accurate then the leak is somewhere in native memory beyond the ken of 
the JVM metrics.


Dave


On 04/07/2023 10:11, Rob @ DNR wrote:
Does this only happen in a container?  Or can you reproduce it 
running locally as well?


If you can reproduce it locally then attaching a profiler like 
VisualVM so you can take a heap snapshot and see where the memory is 
going that would be useful


Rob

From: Dave Reynolds 
Date: Tuesday, 4 July 2023 at 09:31
To: users@jena.apache.org 
Subject: Re: Mystery memory leak in fuseki
Tried 4.7.0 under most up to date java 17 and it acts like 4.8.0. After
16hours it gets to about 1.6GB and by eye has nearly flatted off
somewhat but not completely.

For interest here's a MEM% curve on a 4GB box (hope the link works).

https://www.dropbox.com/s/xjmluk4o3wlwo0y/fuseki-mem-percent.png?dl=0

The flattish curve from 12:00 to 17:20 is a run using 3.16.0 for
comparison. The curve from then onwards is 4.7.0.

The spikes on the 4.7.0 match the allocation and recovery of the direct
memory buffers. The JVM metrics show those cycling around every 10mins
and being reclaimed each time with no leaking visible at that level.
Heap, non-heap and mapped buffers are all basically unchanging which is
to be expected since it's doing nothing apart from reporting metrics.

Whereas this curve (again from 17:20 onwards) shows basically the same
4.7.0 set up on a separate host, showing that despite flattening out
somewhat usage continues to grow - a least on a 16 hour timescale.

https://www.dropbox.com/s/k0v54yq4kexklk0/fuseki-mem-percent-2.png?dl=0


Both of those runs were using Eclipse Temurin on a base Ubuntu jammy
container. Pervious runs used AWS Corretto on an AL2 base container.
Behaviour basically unchanged so eliminates this being some
Corretto-specific issue or a weird base container OS issue.

Dave

On 03/07/2023 14:54, Andy Seaborne wrote:

Hi Dave,

Could you try 4.7.0?

4.6.0 was 2022-08-20
4.7.0 was 2022-12-27
4.8.0 was 2023-04-20

This is an in-memory database?

Micrometer/Prometheus has had several upgrades but if it is not heap 
and

not direct memory (I though that was a hard bound set at start up), I
don't see how it can be involved.

  Andy

On 03/07/2023 14:20, Dave Reynolds wrote:

We have a very strange problem with recent fuseki versions when
running (in docker containers) on small machines. Suspect a jetty
issue but it's not clear.

Wondering if anyone has seen anything like this.

This is a production service but with tiny data (~250k triples, ~60MB
as NQuads). Runs on 4GB machines with java heap allocation of 
500MB[1].


We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
support) with no 

Re: Mystery memory leak in fuseki

2023-07-04 Thread Dave Reynolds
For interest this is what the JVM metrics look like. The main 
heap/non-heap ones are:


https://www.dropbox.com/s/8auux5v352ur04m/fusdeki-metrics-1.png?dl=0

So stable at around 75MB used, 110MB committed.

Whereas the buffer pools are:

https://www.dropbox.com/s/c77b2oarzxjlsa7/fuseki-buffer-metrics.png?dl=0

So gets up to a size comparable with the allowed max heap size (500MB) 
then GC back. Lots of churn just for reporting the metrics but no sign 
of the upward trend which dominates the MEM% curves and nothing to 
explain the growth to 1.8GB and beyond


Guess could try doing a heap dump anyway in case that gives a clue but 
not sure that's the right haystack.


Dave


On 04/07/2023 10:41, Dave Reynolds wrote:
 >  Does this only happen in a container?  Or can you reproduce it 
running locally as well?


Not reproduced locally yet, partly because it's harder to set up the 
equivalent metrics monitoring there.


Can try harder at that.

 > If you can reproduce it locally then attaching a profiler like 
VisualVM so you can take a heap snapshot and see where the memory is 
going that would be useful


Thanks, aware of that option but I thought that would just allow us to 
probe the heap, non-heap and buffer JVM memory pools. We have quite 
detailed monitoring traces on all the JVM metrics which confirms heap 
and non-heap are all fine, sitting stably at a low level and not 
reflecting the leak.


That's also what tells us the direct memory buffers are cycling but 
being properly collected and not leaking. Assuming the JVM metrics are 
accurate then the leak is somewhere in native memory beyond the ken of 
the JVM metrics.


Dave


On 04/07/2023 10:11, Rob @ DNR wrote:
Does this only happen in a container?  Or can you reproduce it running 
locally as well?


If you can reproduce it locally then attaching a profiler like 
VisualVM so you can take a heap snapshot and see where the memory is 
going that would be useful


Rob

From: Dave Reynolds 
Date: Tuesday, 4 July 2023 at 09:31
To: users@jena.apache.org 
Subject: Re: Mystery memory leak in fuseki
Tried 4.7.0 under most up to date java 17 and it acts like 4.8.0. After
16hours it gets to about 1.6GB and by eye has nearly flatted off
somewhat but not completely.

For interest here's a MEM% curve on a 4GB box (hope the link works).

https://www.dropbox.com/s/xjmluk4o3wlwo0y/fuseki-mem-percent.png?dl=0

The flattish curve from 12:00 to 17:20 is a run using 3.16.0 for
comparison. The curve from then onwards is 4.7.0.

The spikes on the 4.7.0 match the allocation and recovery of the direct
memory buffers. The JVM metrics show those cycling around every 10mins
and being reclaimed each time with no leaking visible at that level.
Heap, non-heap and mapped buffers are all basically unchanging which is
to be expected since it's doing nothing apart from reporting metrics.

Whereas this curve (again from 17:20 onwards) shows basically the same
4.7.0 set up on a separate host, showing that despite flattening out
somewhat usage continues to grow - a least on a 16 hour timescale.

https://www.dropbox.com/s/k0v54yq4kexklk0/fuseki-mem-percent-2.png?dl=0


Both of those runs were using Eclipse Temurin on a base Ubuntu jammy
container. Pervious runs used AWS Corretto on an AL2 base container.
Behaviour basically unchanged so eliminates this being some
Corretto-specific issue or a weird base container OS issue.

Dave

On 03/07/2023 14:54, Andy Seaborne wrote:

Hi Dave,

Could you try 4.7.0?

4.6.0 was 2022-08-20
4.7.0 was 2022-12-27
4.8.0 was 2023-04-20

This is an in-memory database?

Micrometer/Prometheus has had several upgrades but if it is not heap and
not direct memory (I though that was a hard bound set at start up), I
don't see how it can be involved.

  Andy

On 03/07/2023 14:20, Dave Reynolds wrote:

We have a very strange problem with recent fuseki versions when
running (in docker containers) on small machines. Suspect a jetty
issue but it's not clear.

Wondering if anyone has seen anything like this.

This is a production service but with tiny data (~250k triples, ~60MB
as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].

We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
support) with no problems.

Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of
a day or so to reach ~3GB of memory at which point the 4GB machine
becomes unviable and things get OOM killed.

The strange thing is that this growth happens when the system is
answering no Sparql queries at all, just regular health ping checks
and (prometheus) metrics scrapes from the monitoring systems.

Furthermore the space being consumed is not visible to any of the JVM
metrics:
- Heap and and non-heap are stable at around 100MB total (mostly
non-heap metaspace).
- Mapped buffers stay at 50MB and remain long term stable.
- Direct memory buffers being allocated up to around 500MB then being
reclaimed. Since there are no sparql queries at all we assume this 

Re: Mystery memory leak in fuseki

2023-07-04 Thread Dave Reynolds
>  Does this only happen in a container?  Or can you reproduce it 
running locally as well?


Not reproduced locally yet, partly because it's harder to set up the 
equivalent metrics monitoring there.


Can try harder at that.

> If you can reproduce it locally then attaching a profiler like 
VisualVM so you can take a heap snapshot and see where the memory is 
going that would be useful


Thanks, aware of that option but I thought that would just allow us to 
probe the heap, non-heap and buffer JVM memory pools. We have quite 
detailed monitoring traces on all the JVM metrics which confirms heap 
and non-heap are all fine, sitting stably at a low level and not 
reflecting the leak.


That's also what tells us the direct memory buffers are cycling but 
being properly collected and not leaking. Assuming the JVM metrics are 
accurate then the leak is somewhere in native memory beyond the ken of 
the JVM metrics.


Dave


On 04/07/2023 10:11, Rob @ DNR wrote:

Does this only happen in a container?  Or can you reproduce it running locally 
as well?

If you can reproduce it locally then attaching a profiler like VisualVM so you 
can take a heap snapshot and see where the memory is going that would be useful

Rob

From: Dave Reynolds 
Date: Tuesday, 4 July 2023 at 09:31
To: users@jena.apache.org 
Subject: Re: Mystery memory leak in fuseki
Tried 4.7.0 under most up to date java 17 and it acts like 4.8.0. After
16hours it gets to about 1.6GB and by eye has nearly flatted off
somewhat but not completely.

For interest here's a MEM% curve on a 4GB box (hope the link works).

https://www.dropbox.com/s/xjmluk4o3wlwo0y/fuseki-mem-percent.png?dl=0

The flattish curve from 12:00 to 17:20 is a run using 3.16.0 for
comparison. The curve from then onwards is 4.7.0.

The spikes on the 4.7.0 match the allocation and recovery of the direct
memory buffers. The JVM metrics show those cycling around every 10mins
and being reclaimed each time with no leaking visible at that level.
Heap, non-heap and mapped buffers are all basically unchanging which is
to be expected since it's doing nothing apart from reporting metrics.

Whereas this curve (again from 17:20 onwards) shows basically the same
4.7.0 set up on a separate host, showing that despite flattening out
somewhat usage continues to grow - a least on a 16 hour timescale.

https://www.dropbox.com/s/k0v54yq4kexklk0/fuseki-mem-percent-2.png?dl=0


Both of those runs were using Eclipse Temurin on a base Ubuntu jammy
container. Pervious runs used AWS Corretto on an AL2 base container.
Behaviour basically unchanged so eliminates this being some
Corretto-specific issue or a weird base container OS issue.

Dave

On 03/07/2023 14:54, Andy Seaborne wrote:

Hi Dave,

Could you try 4.7.0?

4.6.0 was 2022-08-20
4.7.0 was 2022-12-27
4.8.0 was 2023-04-20

This is an in-memory database?

Micrometer/Prometheus has had several upgrades but if it is not heap and
not direct memory (I though that was a hard bound set at start up), I
don't see how it can be involved.

  Andy

On 03/07/2023 14:20, Dave Reynolds wrote:

We have a very strange problem with recent fuseki versions when
running (in docker containers) on small machines. Suspect a jetty
issue but it's not clear.

Wondering if anyone has seen anything like this.

This is a production service but with tiny data (~250k triples, ~60MB
as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].

We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
support) with no problems.

Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of
a day or so to reach ~3GB of memory at which point the 4GB machine
becomes unviable and things get OOM killed.

The strange thing is that this growth happens when the system is
answering no Sparql queries at all, just regular health ping checks
and (prometheus) metrics scrapes from the monitoring systems.

Furthermore the space being consumed is not visible to any of the JVM
metrics:
- Heap and and non-heap are stable at around 100MB total (mostly
non-heap metaspace).
- Mapped buffers stay at 50MB and remain long term stable.
- Direct memory buffers being allocated up to around 500MB then being
reclaimed. Since there are no sparql queries at all we assume this is
jetty NIO buffers being churned as a result of the metric scrapes.
However, this direct buffer behaviour seems stable, it cycles between
0 and 500MB on approx a 10min cycle but is stable over a period of
days and shows no leaks.

Yet the java process grows from an initial 100MB to at least 3GB. This
can occur in the space of a couple of hours or can take up to a day or
two with no predictability in how fast.

Presumably there is some low level JNI space allocated by Jetty (?)
which is invisible to all the JVM metrics and is not being reliably
reclaimed.

Trying 4.6.0, which we've had less problems with elsewhere, that seems
to grow to around 1GB (plus up to 0.5GB for the cycling direct memory
buffers) and then stays 

Re: Mystery memory leak in fuseki

2023-07-04 Thread Martynas Jusevičius
You can profile it in the container as well :)
https://github.com/AtomGraph/fuseki-docker#profiling

On Tue, 4 Jul 2023 at 11.12, Rob @ DNR  wrote:

> Does this only happen in a container?  Or can you reproduce it running
> locally as well?
>
> If you can reproduce it locally then attaching a profiler like VisualVM so
> you can take a heap snapshot and see where the memory is going that would
> be useful
>
> Rob
>
> From: Dave Reynolds 
> Date: Tuesday, 4 July 2023 at 09:31
> To: users@jena.apache.org 
> Subject: Re: Mystery memory leak in fuseki
> Tried 4.7.0 under most up to date java 17 and it acts like 4.8.0. After
> 16hours it gets to about 1.6GB and by eye has nearly flatted off
> somewhat but not completely.
>
> For interest here's a MEM% curve on a 4GB box (hope the link works).
>
> https://www.dropbox.com/s/xjmluk4o3wlwo0y/fuseki-mem-percent.png?dl=0
>
> The flattish curve from 12:00 to 17:20 is a run using 3.16.0 for
> comparison. The curve from then onwards is 4.7.0.
>
> The spikes on the 4.7.0 match the allocation and recovery of the direct
> memory buffers. The JVM metrics show those cycling around every 10mins
> and being reclaimed each time with no leaking visible at that level.
> Heap, non-heap and mapped buffers are all basically unchanging which is
> to be expected since it's doing nothing apart from reporting metrics.
>
> Whereas this curve (again from 17:20 onwards) shows basically the same
> 4.7.0 set up on a separate host, showing that despite flattening out
> somewhat usage continues to grow - a least on a 16 hour timescale.
>
> https://www.dropbox.com/s/k0v54yq4kexklk0/fuseki-mem-percent-2.png?dl=0
>
>
> Both of those runs were using Eclipse Temurin on a base Ubuntu jammy
> container. Pervious runs used AWS Corretto on an AL2 base container.
> Behaviour basically unchanged so eliminates this being some
> Corretto-specific issue or a weird base container OS issue.
>
> Dave
>
> On 03/07/2023 14:54, Andy Seaborne wrote:
> > Hi Dave,
> >
> > Could you try 4.7.0?
> >
> > 4.6.0 was 2022-08-20
> > 4.7.0 was 2022-12-27
> > 4.8.0 was 2023-04-20
> >
> > This is an in-memory database?
> >
> > Micrometer/Prometheus has had several upgrades but if it is not heap and
> > not direct memory (I though that was a hard bound set at start up), I
> > don't see how it can be involved.
> >
> >  Andy
> >
> > On 03/07/2023 14:20, Dave Reynolds wrote:
> >> We have a very strange problem with recent fuseki versions when
> >> running (in docker containers) on small machines. Suspect a jetty
> >> issue but it's not clear.
> >>
> >> Wondering if anyone has seen anything like this.
> >>
> >> This is a production service but with tiny data (~250k triples, ~60MB
> >> as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].
> >>
> >> We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
> >> support) with no problems.
> >>
> >> Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of
> >> a day or so to reach ~3GB of memory at which point the 4GB machine
> >> becomes unviable and things get OOM killed.
> >>
> >> The strange thing is that this growth happens when the system is
> >> answering no Sparql queries at all, just regular health ping checks
> >> and (prometheus) metrics scrapes from the monitoring systems.
> >>
> >> Furthermore the space being consumed is not visible to any of the JVM
> >> metrics:
> >> - Heap and and non-heap are stable at around 100MB total (mostly
> >> non-heap metaspace).
> >> - Mapped buffers stay at 50MB and remain long term stable.
> >> - Direct memory buffers being allocated up to around 500MB then being
> >> reclaimed. Since there are no sparql queries at all we assume this is
> >> jetty NIO buffers being churned as a result of the metric scrapes.
> >> However, this direct buffer behaviour seems stable, it cycles between
> >> 0 and 500MB on approx a 10min cycle but is stable over a period of
> >> days and shows no leaks.
> >>
> >> Yet the java process grows from an initial 100MB to at least 3GB. This
> >> can occur in the space of a couple of hours or can take up to a day or
> >> two with no predictability in how fast.
> >>
> >> Presumably there is some low level JNI space allocated by Jetty (?)
> >> which is invisible to all the JVM metrics and is not being reliably
> >> reclaimed.
> >>
> >> Trying 4.6.0, which we've had less problems with elsewhere, that seems
> >> to grow to around 1GB (plus up to 0.5GB for the cycling direct memory
> >> buffers) and then stays stable (at least on a three day soak test).
> >> We could live with allocating 1.5GB to a system that should only need
> >> a few 100MB but concerned that it may not be stable in the really long
> >> term and, in any case, would rather be able to update to more recent
> >> fuseki versions.
> >>
> >> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then
> >> keeps ticking up slowly at random intervals. We project that it would
> >> take a few weeks to grow the 

Re: Mystery memory leak in fuseki

2023-07-04 Thread Rob @ DNR
Does this only happen in a container?  Or can you reproduce it running locally 
as well?

If you can reproduce it locally then attaching a profiler like VisualVM so you 
can take a heap snapshot and see where the memory is going that would be useful

Rob

From: Dave Reynolds 
Date: Tuesday, 4 July 2023 at 09:31
To: users@jena.apache.org 
Subject: Re: Mystery memory leak in fuseki
Tried 4.7.0 under most up to date java 17 and it acts like 4.8.0. After
16hours it gets to about 1.6GB and by eye has nearly flatted off
somewhat but not completely.

For interest here's a MEM% curve on a 4GB box (hope the link works).

https://www.dropbox.com/s/xjmluk4o3wlwo0y/fuseki-mem-percent.png?dl=0

The flattish curve from 12:00 to 17:20 is a run using 3.16.0 for
comparison. The curve from then onwards is 4.7.0.

The spikes on the 4.7.0 match the allocation and recovery of the direct
memory buffers. The JVM metrics show those cycling around every 10mins
and being reclaimed each time with no leaking visible at that level.
Heap, non-heap and mapped buffers are all basically unchanging which is
to be expected since it's doing nothing apart from reporting metrics.

Whereas this curve (again from 17:20 onwards) shows basically the same
4.7.0 set up on a separate host, showing that despite flattening out
somewhat usage continues to grow - a least on a 16 hour timescale.

https://www.dropbox.com/s/k0v54yq4kexklk0/fuseki-mem-percent-2.png?dl=0


Both of those runs were using Eclipse Temurin on a base Ubuntu jammy
container. Pervious runs used AWS Corretto on an AL2 base container.
Behaviour basically unchanged so eliminates this being some
Corretto-specific issue or a weird base container OS issue.

Dave

On 03/07/2023 14:54, Andy Seaborne wrote:
> Hi Dave,
>
> Could you try 4.7.0?
>
> 4.6.0 was 2022-08-20
> 4.7.0 was 2022-12-27
> 4.8.0 was 2023-04-20
>
> This is an in-memory database?
>
> Micrometer/Prometheus has had several upgrades but if it is not heap and
> not direct memory (I though that was a hard bound set at start up), I
> don't see how it can be involved.
>
>  Andy
>
> On 03/07/2023 14:20, Dave Reynolds wrote:
>> We have a very strange problem with recent fuseki versions when
>> running (in docker containers) on small machines. Suspect a jetty
>> issue but it's not clear.
>>
>> Wondering if anyone has seen anything like this.
>>
>> This is a production service but with tiny data (~250k triples, ~60MB
>> as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].
>>
>> We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
>> support) with no problems.
>>
>> Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of
>> a day or so to reach ~3GB of memory at which point the 4GB machine
>> becomes unviable and things get OOM killed.
>>
>> The strange thing is that this growth happens when the system is
>> answering no Sparql queries at all, just regular health ping checks
>> and (prometheus) metrics scrapes from the monitoring systems.
>>
>> Furthermore the space being consumed is not visible to any of the JVM
>> metrics:
>> - Heap and and non-heap are stable at around 100MB total (mostly
>> non-heap metaspace).
>> - Mapped buffers stay at 50MB and remain long term stable.
>> - Direct memory buffers being allocated up to around 500MB then being
>> reclaimed. Since there are no sparql queries at all we assume this is
>> jetty NIO buffers being churned as a result of the metric scrapes.
>> However, this direct buffer behaviour seems stable, it cycles between
>> 0 and 500MB on approx a 10min cycle but is stable over a period of
>> days and shows no leaks.
>>
>> Yet the java process grows from an initial 100MB to at least 3GB. This
>> can occur in the space of a couple of hours or can take up to a day or
>> two with no predictability in how fast.
>>
>> Presumably there is some low level JNI space allocated by Jetty (?)
>> which is invisible to all the JVM metrics and is not being reliably
>> reclaimed.
>>
>> Trying 4.6.0, which we've had less problems with elsewhere, that seems
>> to grow to around 1GB (plus up to 0.5GB for the cycling direct memory
>> buffers) and then stays stable (at least on a three day soak test).
>> We could live with allocating 1.5GB to a system that should only need
>> a few 100MB but concerned that it may not be stable in the really long
>> term and, in any case, would rather be able to update to more recent
>> fuseki versions.
>>
>> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then
>> keeps ticking up slowly at random intervals. We project that it would
>> take a few weeks to grow the scale it did under java 11 but it will
>> still eventually kill the machine.
>>
>> Anyone seem anything remotely like this?
>>
>> Dave
>>
>> [1]  500M heap may be overkill but there can be some complex queries
>> and that should still leave plenty of space for OS buffers etc in the
>> remaining memory on a 4GB machine.
>>
>>
>>


Re: Mystery memory leak in fuseki

2023-07-04 Thread Dave Reynolds
Tried 4.7.0 under most up to date java 17 and it acts like 4.8.0. After 
16hours it gets to about 1.6GB and by eye has nearly flatted off 
somewhat but not completely.


For interest here's a MEM% curve on a 4GB box (hope the link works).

https://www.dropbox.com/s/xjmluk4o3wlwo0y/fuseki-mem-percent.png?dl=0

The flattish curve from 12:00 to 17:20 is a run using 3.16.0 for 
comparison. The curve from then onwards is 4.7.0.


The spikes on the 4.7.0 match the allocation and recovery of the direct 
memory buffers. The JVM metrics show those cycling around every 10mins 
and being reclaimed each time with no leaking visible at that level. 
Heap, non-heap and mapped buffers are all basically unchanging which is 
to be expected since it's doing nothing apart from reporting metrics.


Whereas this curve (again from 17:20 onwards) shows basically the same 
4.7.0 set up on a separate host, showing that despite flattening out 
somewhat usage continues to grow - a least on a 16 hour timescale.


https://www.dropbox.com/s/k0v54yq4kexklk0/fuseki-mem-percent-2.png?dl=0


Both of those runs were using Eclipse Temurin on a base Ubuntu jammy 
container. Pervious runs used AWS Corretto on an AL2 base container. 
Behaviour basically unchanged so eliminates this being some 
Corretto-specific issue or a weird base container OS issue.


Dave

On 03/07/2023 14:54, Andy Seaborne wrote:

Hi Dave,

Could you try 4.7.0?

4.6.0 was 2022-08-20
4.7.0 was 2022-12-27
4.8.0 was 2023-04-20

This is an in-memory database?

Micrometer/Prometheus has had several upgrades but if it is not heap and 
not direct memory (I though that was a hard bound set at start up), I 
don't see how it can be involved.


     Andy

On 03/07/2023 14:20, Dave Reynolds wrote:
We have a very strange problem with recent fuseki versions when 
running (in docker containers) on small machines. Suspect a jetty 
issue but it's not clear.


Wondering if anyone has seen anything like this.

This is a production service but with tiny data (~250k triples, ~60MB 
as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].


We used to run using 3.16 on jdk 8 (AWS Corretto for the long term 
support) with no problems.


Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of 
a day or so to reach ~3GB of memory at which point the 4GB machine 
becomes unviable and things get OOM killed.


The strange thing is that this growth happens when the system is 
answering no Sparql queries at all, just regular health ping checks 
and (prometheus) metrics scrapes from the monitoring systems.


Furthermore the space being consumed is not visible to any of the JVM 
metrics:
- Heap and and non-heap are stable at around 100MB total (mostly 
non-heap metaspace).

- Mapped buffers stay at 50MB and remain long term stable.
- Direct memory buffers being allocated up to around 500MB then being 
reclaimed. Since there are no sparql queries at all we assume this is 
jetty NIO buffers being churned as a result of the metric scrapes. 
However, this direct buffer behaviour seems stable, it cycles between 
0 and 500MB on approx a 10min cycle but is stable over a period of 
days and shows no leaks.


Yet the java process grows from an initial 100MB to at least 3GB. This 
can occur in the space of a couple of hours or can take up to a day or 
two with no predictability in how fast.


Presumably there is some low level JNI space allocated by Jetty (?) 
which is invisible to all the JVM metrics and is not being reliably 
reclaimed.


Trying 4.6.0, which we've had less problems with elsewhere, that seems 
to grow to around 1GB (plus up to 0.5GB for the cycling direct memory 
buffers) and then stays stable (at least on a three day soak test).  
We could live with allocating 1.5GB to a system that should only need 
a few 100MB but concerned that it may not be stable in the really long 
term and, in any case, would rather be able to update to more recent 
fuseki versions.


Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then 
keeps ticking up slowly at random intervals. We project that it would 
take a few weeks to grow the scale it did under java 11 but it will 
still eventually kill the machine.


Anyone seem anything remotely like this?

Dave

[1]  500M heap may be overkill but there can be some complex queries 
and that should still leave plenty of space for OS buffers etc in the 
remaining memory on a 4GB machine.






Re: Mystery memory leak in fuseki

2023-07-04 Thread Dave Reynolds

Thanks for the suggestion, that could be useful.

Not managed to make that work yet. From within the container get 
permission denied, and running it on the host is no use because the 
relevant so's aren't where ltrace expects and it crashes out.


Similarly strace can't attach to the process in the container and 
running on the host gives no info.


Guess would have to replicate the set up without using containers. 
Certainly possible but a fair amount of work and loses all the metrics 
we get from the container stack. May have to resort to that.


Dave

On 03/07/2023 22:22, Justin wrote:

You might try running `ltrace` to watch the library calls and system calls
the jvm is making.
e.g.
ltrace -S -f -p 

I think the `sbrk` system call is used to allocate memory. It might be
interesting to see if you can catch the jvm invoking that system call and
also see what is happening around it.

On Mon, Jul 3, 2023 at 10:50 AM Dave Reynolds 
wrote:


On 03/07/2023 14:36, Martynas Jusevičius wrote:

There have been a few similar threads:

https://www.mail-archive.com/users@jena.apache.org/msg19871.html

https://www.mail-archive.com/users@jena.apache.org/msg18825.html



Thanks, I've seen those and not sure they quite match our case but maybe
I'm mistaken.

We already have a smallish heap allocation (500MB) which seem to be a
key conclusion of both those threads. Though I guess we could try even
lower.

Furthermore the second thread was related to 3.16.0 which is completely
stable for us at 150MB (rather than the 1.5GB that 4.6.* gets to, let
alone the 3+GB that gets 4.8.0 killed).

Dave




On Mon, 3 Jul 2023 at 15.20, Dave Reynolds 
wrote:


We have a very strange problem with recent fuseki versions when running
(in docker containers) on small machines. Suspect a jetty issue but it's
not clear.

Wondering if anyone has seen anything like this.

This is a production service but with tiny data (~250k triples, ~60MB as
NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].

We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
support) with no problems.

Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of a
day or so to reach ~3GB of memory at which point the 4GB machine becomes
unviable and things get OOM killed.

The strange thing is that this growth happens when the system is
answering no Sparql queries at all, just regular health ping checks and
(prometheus) metrics scrapes from the monitoring systems.

Furthermore the space being consumed is not visible to any of the JVM
metrics:
- Heap and and non-heap are stable at around 100MB total (mostly
non-heap metaspace).
- Mapped buffers stay at 50MB and remain long term stable.
- Direct memory buffers being allocated up to around 500MB then being
reclaimed. Since there are no sparql queries at all we assume this is
jetty NIO buffers being churned as a result of the metric scrapes.
However, this direct buffer behaviour seems stable, it cycles between 0
and 500MB on approx a 10min cycle but is stable over a period of days
and shows no leaks.

Yet the java process grows from an initial 100MB to at least 3GB. This
can occur in the space of a couple of hours or can take up to a day or
two with no predictability in how fast.

Presumably there is some low level JNI space allocated by Jetty (?)
which is invisible to all the JVM metrics and is not being reliably
reclaimed.

Trying 4.6.0, which we've had less problems with elsewhere, that seems
to grow to around 1GB (plus up to 0.5GB for the cycling direct memory
buffers) and then stays stable (at least on a three day soak test).  We
could live with allocating 1.5GB to a system that should only need a few
100MB but concerned that it may not be stable in the really long term
and, in any case, would rather be able to update to more recent fuseki
versions.

Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then
keeps ticking up slowly at random intervals. We project that it would
take a few weeks to grow the scale it did under java 11 but it will
still eventually kill the machine.

Anyone seem anything remotely like this?

Dave

[1]  500M heap may be overkill but there can be some complex queries and
that should still leave plenty of space for OS buffers etc in the
remaining memory on a 4GB machine.