Thanks for the suggestion, that could be useful.
Not managed to make that work yet. From within the container get
permission denied, and running it on the host is no use because the
relevant so's aren't where ltrace expects and it crashes out.
Similarly strace can't attach to the process in the container and
running on the host gives no info.
Guess would have to replicate the set up without using containers.
Certainly possible but a fair amount of work and loses all the metrics
we get from the container stack. May have to resort to that.
Dave
On 03/07/2023 22:22, Justin wrote:
You might try running `ltrace` to watch the library calls and system calls
the jvm is making.
e.g.
ltrace -S -f -p <your fuseki PID here>
I think the `sbrk` system call is used to allocate memory. It might be
interesting to see if you can catch the jvm invoking that system call and
also see what is happening around it.
On Mon, Jul 3, 2023 at 10:50 AM Dave Reynolds <[email protected]>
wrote:
On 03/07/2023 14:36, Martynas Jusevičius wrote:
There have been a few similar threads:
https://www.mail-archive.com/[email protected]/msg19871.html
https://www.mail-archive.com/[email protected]/msg18825.html
Thanks, I've seen those and not sure they quite match our case but maybe
I'm mistaken.
We already have a smallish heap allocation (500MB) which seem to be a
key conclusion of both those threads. Though I guess we could try even
lower.
Furthermore the second thread was related to 3.16.0 which is completely
stable for us at 150MB (rather than the 1.5GB that 4.6.* gets to, let
alone the 3+GB that gets 4.8.0 killed).
Dave
On Mon, 3 Jul 2023 at 15.20, Dave Reynolds <[email protected]>
wrote:
We have a very strange problem with recent fuseki versions when running
(in docker containers) on small machines. Suspect a jetty issue but it's
not clear.
Wondering if anyone has seen anything like this.
This is a production service but with tiny data (~250k triples, ~60MB as
NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].
We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
support) with no problems.
Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of a
day or so to reach ~3GB of memory at which point the 4GB machine becomes
unviable and things get OOM killed.
The strange thing is that this growth happens when the system is
answering no Sparql queries at all, just regular health ping checks and
(prometheus) metrics scrapes from the monitoring systems.
Furthermore the space being consumed is not visible to any of the JVM
metrics:
- Heap and and non-heap are stable at around 100MB total (mostly
non-heap metaspace).
- Mapped buffers stay at 50MB and remain long term stable.
- Direct memory buffers being allocated up to around 500MB then being
reclaimed. Since there are no sparql queries at all we assume this is
jetty NIO buffers being churned as a result of the metric scrapes.
However, this direct buffer behaviour seems stable, it cycles between 0
and 500MB on approx a 10min cycle but is stable over a period of days
and shows no leaks.
Yet the java process grows from an initial 100MB to at least 3GB. This
can occur in the space of a couple of hours or can take up to a day or
two with no predictability in how fast.
Presumably there is some low level JNI space allocated by Jetty (?)
which is invisible to all the JVM metrics and is not being reliably
reclaimed.
Trying 4.6.0, which we've had less problems with elsewhere, that seems
to grow to around 1GB (plus up to 0.5GB for the cycling direct memory
buffers) and then stays stable (at least on a three day soak test). We
could live with allocating 1.5GB to a system that should only need a few
100MB but concerned that it may not be stable in the really long term
and, in any case, would rather be able to update to more recent fuseki
versions.
Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then
keeps ticking up slowly at random intervals. We project that it would
take a few weeks to grow the scale it did under java 11 but it will
still eventually kill the machine.
Anyone seem anything remotely like this?
Dave
[1] 500M heap may be overkill but there can be some complex queries and
that should still leave plenty of space for OS buffers etc in the
remaining memory on a 4GB machine.