Re: Re: Mystery memory leak in fuseki

Martynas Jusevičius Fri, 21 Jul 2023 11:37:54 -0700

There is one more variable in this picture: Java’s container awareness
https://developers.redhat.com/articles/2022/04/19/java-17-whats-new-openjdks-container-awareness


Whether it has an impact in this case, I have no idea :)

On Thu, 20 Jul 2023 at 11.48, Conal McLaughlin <conal.mclaugh...@iotics.com>
wrote:

> Hey Andy,
>
> Metaspace seems to be stable!
>
> We’re running this on Java 11 currently.
> I can check it out with Java17 though.
>
> We’ve currently set Xms/Xmx to 2560MB & MaxMetaspaceSize to 256MB.
>
> The ECS task is set with a ceiling of 4GB Memory & 1vcpu.
>
> Could it be more of a race condition than size of used objects, due to
> logging?
> I do see some time sensitive eviction code in Jetty -
> https://github.com/eclipse/jetty.project/blob/9e16d81cf8922c75e3d2d96c66442b896a9c69e1/jetty-io/src/main/java/org/eclipse/jetty/io/ArrayRetainableByteBufferPool.java#L374
> Not sure if the same type of thing exists in the Jena codebase.
>
> I will try to check with `—empty` also.
>
>
> Thanks,
> Conal
>
> On 2023/07/19 21:10:24 Andy Seaborne wrote:
> > Conal,
> >
> > Thanks for the information.
> > Can you see if metaspace is growing as well?
> >
> > All,
> >
> > Could someone please try running Fuseki main, with no datasets (--empty)
> > with some healthcheck ping traffic.
> >
> >      Andy
> >
> > On 19/07/2023 14:42, Conal McLaughlin wrote:
> > > Hey Dave,
> > >
> > > Thank you for providing an in depth analysis of your issues.
> > > We appear to be witnessing the same type of problems with our current
> > > Fuseki deployment.
> > > We are deploying a containerised Fuseki into a AWS ECS task alongside
> > > other containers - this may not be ideal but that’s a different story.
> > >
> > > I just wanted to add another data point to everything you have
> described.
> > > Firstly, it does seem like “idle” (or very low traffic) instances are
> > > the problem, for us (coupled with a larger heap than necessary).
> > > We witness the same increase in the ECS task memory consumption up
> until
> > > the whole thing is killed off. Which includes the Fuseki container.
> > >
> > > In an attempt to see what was going on beneath the hood, we turned up
> > > the logging to TRACE in the log4j2.xml file provided to Fuseki.
> > > This appeared to stabilise the increasing memory consumption.
> > > Even just switching the `logger.jetty.level` to TRACE alleviates the
> issue.
> >
> > Colour me confused!
> >
> > A Log4j logger that is active will use a few objects - may that's enough
> > to trigger a minor GC which in turn is enough to flush some non-heap
> > resources.
> >
> > How big is the heap?
> > This is Java17?
> >
> > > We are testing this on Fuseki 4.8.0/TDB2 with close to 0 triples and
> > > extremely low query traffic / health checks via /ping.
> > > KPk7uhH2F9Lp.png
> > > ecs-task-memory - Image on Pasteboard
> > > <https://pasteboard.co/KPk7uhH2F9Lp.png>
> > > pasteboard.co <https://pasteboard.co/KPk7uhH2F9Lp.png>
> > >
> > > <https://pasteboard.co/KPk7uhH2F9Lp.png>
> > >
> > >
> > > Cheers,
> > > Conal
> > >
> > > On 2023/07/11 09:31:25 Dave Reynolds wrote:
> > >  > Hi Rob,
> > >  >
> > >  > Good point. Will try to find time to experiment with that but given
> the
> > >  > testing cycle time that will take a while and can't start
> immediately.
> > >  >
> > >  > I'm a little sceptical though. As mentioned before, all the metrics
> we
> > >  > see show the direct memory pool that Jetty uses cycling up the max
> heap
> > >  > size and then being collected but with no long term growth to match
> the
> > >  > process size growth. This really feels more like a bug (though not
> sure
> > >  > where) than tuning. The fact that actual behaviour doesn't match the
> > >  > documentation isn't encouraging.
> > >  >
> > >  > It's also pretty hard to figure what the right pool configuration
> would
> > >  > be. This thing is just being asked to deliver a few metrics (12KB
> per
> > >  > request) several times a minute but manages to eat 500MB of direct
> > >  > buffer space every 5mins. So what the right pool parameters are to
> > >  > support real usage peaks is not going to be easy to figure out.
> > >  >
> > >  > None the less you are right. That's something that should be
> explored.
> > >  >
> > >  > Dave
> > >  >
> > >  >
> > >  > On 11/07/2023 09:45, Rob @ DNR wrote:
> > >  > > Dave
> > >  > >
> > >  > > Thanks for the further information.
> > >  > >
> > >  > > Have you experimented with using Jetty 10 but providing more
> > > detailed configuration?Fuseki supports providing detailed Jetty
> > > configuration if needed via the --jetty-config option
> > >  > >
> > >  > > The following section look relevant:
> > >  > >
> > >  > >
> > >
> https://eclipse.dev/jetty/documentation/jetty-10/operations-guide/index.html#og-module-bytebufferpool
> > >  > >
> > >  > > It looks like the default is that Jetty uses a heuristic to
> > > determine these values, sadly the heuristic in question is not
> detailed
> > > in that documentation.
> > >  > >
> > >  > > Best guess from digging through their code is that the
> “heuristic”
> > > is this:
> > >  > >
> > >  > >
> > >
> https://github.com/eclipse/jetty.project/blob/jetty-10.0.x/jetty-io/src/main/java/org/eclipse/jetty/io/AbstractByteBufferPool.java#L78-L84
> > >  > >
> > >  > > i.e., ¼ of the configured max heap size.This doesn’t necessarily
> > > align with the exact sizes of process growth you see but I note the
> > > documentation does explicitly say that buffers used can go beyond
> these
> > > limits but that those will just be GC’d rather than pooled for reuse.
> > >  > >
> > >  > > Example byte buffer configuration at
> > >
> https://github.com/eclipse/jetty.project/blob/9a05c75ad28ebad4abbe624fa432664c59763747/jetty-server/src/main/config/etc/jetty-bytebufferpool.xml#L4
> > >  > >
> > >  > > Any chance you could try customising this for your needs with
> stock
> > > Fuseki and see if this allows you to make the process size smaller and
> > > sufficiently predictable for your use case?
> > >  > >
> > >  > > Rob
> > >  > >
> > >  > > From: Dave Reynolds <da...@gmail.com>
> > >  > > Date: Tuesday, 11 July 2023 at 08:58
> > >  > > To: users@jena.apache.org <us...@jena.apache.org>
> > >  > > Subject: Re: Mystery memory leak in fuseki
> > >  > > For interest[*] ...
> > >  > >
> > >  > > This is what the core JVM metrics look like when transitioning
> from a
> > >  > > Jetty10 to a Jetty9.4 instance. You can see the direct buffer
> > > cycling up
> > >  > > to 500MB (which happens to be the max heap setting) on Jetty 10,
> > > nothing
> > >  > > on Jetty 9. The drop in Mapped buffers is just because TDB hadn't
> been
> > >  > > asked any queries yet.
> > >  > >
> > >  > >
> > >
> https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m&dl=0
> > >  > >
> > >  > > Here' the same metrics around the time of triggering a TDB
> backup.
> > > Shows
> > >  > > the mapped buffer use for TDB but no significant impact on heap
> etc.
> > >  > >
> > >  > >
> > >
> https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna&dl=0
> > >  > >
> > >  > > These are all on the same instance as the RES memory trace:
> > >  > >
> > >  > >
> > >
> https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
> > >  > >
> > >  > > Dave
> > >  > >
> > >  > > [*] I've been staring and metric graphs for so many days I may
> have a
> > >  > > distorted notion of what's interesting :)
> > >  > >
> > >  > > On 11/07/2023 08:39, Dave Reynolds wrote:
> > >  > >> After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in the
> > >  > >> production, containerized, environment then it is indeed very
> stable.
> > >  > >>
> > >  > >> Running at less that 6% of memory on 4GB machine compared to
> peaks of
> > >  > >> ~50% for versions with Jetty 10. RES shows as 240K with 35K
> shared
> > >  > >> (presume mostly libraries).
> > >  > >>
> > >  > >> Copy of trace is:
> > >  > >>
> > >
> https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
> > >  > >>
> > >  > >> The high spikes on left of image are the prior run on with out
> of the
> > >  > >> box 4.7.0 on same JVM.
> > >  > >>
> > >  > >> The small spike at 06:00 is a dump so TDB was able to touch and
> > > scan all
> > >  > >> the (modest) data with very minor blip in resident size (as
> you'd
> > > hope).
> > >  > >> JVM stats show the mapped buffers for TDB jumping up but confirm
> > > heap is
> > >  > >> stable at < 60M, non-heap 60M.
> > >  > >>
> > >  > >> Dave
> > >  > >>
> > >  > >> On 10/07/2023 20:52, Dave Reynolds wrote:
> > >  > >>> Since this thread has got complex, I'm posting this update here
> > > at the
> > >  > >>> top level.
> > >  > >>>
> > >  > >>> Thanks to folks, especially Andy and Rob for suggestions and for
> > >  > >>> investigating.
> > >  > >>>
> > >  > >>> After a lot more testing at our end I believe we now have some
> > >  > >>> workarounds.
> > >  > >>>
> > >  > >>> First, at least on java 17, the process growth does seem to
> level
> > > out.
> > >  > >>> Despite what I just said to Rob, having just checked our soak
> > > tests, a
> > >  > >>> jena 4.7.0/java 17 test with 500MB max heap has lasted for 7
> days.
> > >  > >>> Process size oscillates between 1.5GB and 2GB but hasn't gone
> above
> > >  > >>> that in a week. The oscillation is almost entirely the cycling
> of the
> > >  > >>> direct memory buffers used by Jetty. Empirically those cycle up
> to
> > >  > >>> something comparable to the set max heap size, at least for us.
> > >  > >>>
> > >  > >>> While this week long test was 4.7.0, based on earlier tests I
> suspect
> > >  > >>> 4.8.0 (and now 4.9.0) would also level out at least on a
> timescale of
> > >  > >>> days.
> > >  > >>>
> > >  > >>> The key has been setting the max heap low. At 2GB and even 1GB
> (the
> > >  > >>> default on a 4GB machine) we see higher peak levels of direct
> buffers
> > >  > >>> and overall process size grew to around 3GB at which point the
> > >  > >>> container is killed on the small machines. Though java 17 does
> > > seem to
> > >  > >>> be better behaved that java 11, so switching to that probably
> also
> > >  > >>> helped.
> > >  > >>>
> > >  > >>> Given the actual heap is low (50MB heap, 60MB non-heap) then
> needing
> > >  > >>> 2GB to run in feels high but is workable. So my previously
> suggested
> > >  > >>> rule of thumb that, in this low memory regime, allow 4x the max
> heap
> > >  > >>> size seems to work.
> > >  > >>>
> > >  > >>> Second, we're now pretty confident the issue is jetty 10+.
> > >  > >>>
> > >  > >>> We've built a fuseki-server 4.9.0 with Jetty replaced by version
> > >  > >>> 9.4.51.v20230217. This required some minor source changes to
> compile
> > >  > >>> and pass tests. On a local bare metal test where we saw process
> > > growth
> > >  > >>> up to 1.5-2GB this build has run stably using less than 500MB
> for 4
> > >  > >>> hours.
> > >  > >>>
> > >  > >>> We'll set a longer term test running in the target containerized
> > >  > >>> environment to confirm things but quite hopeful this will be
> long
> > > term
> > >  > >>> stable.
> > >  > >>>
> > >  > >>> I realise Jetty 9.4.x is out of community support but eclipse
> say EOL
> > >  > >>> is "unlikely to happen before 2025". So, while this may not be a
> > >  > >>> solution for the Jena project, it could give us a workaround at
> the
> > >  > >>> cost of doing custom builds.
> > >  > >>>
> > >  > >>> Dave
> > >  > >>>
> > >  > >>>
> > >  > >>> On 03/07/2023 14:20, Dave Reynolds wrote:
> > >  > >>>> We have a very strange problem with recent fuseki versions when
> > >  > >>>> running (in docker containers) on small machines. Suspect a
> jetty
> > >  > >>>> issue but it's not clear.
> > >  > >>>>
> > >  > >>>> Wondering if anyone has seen anything like this.
> > >  > >>>>
> > >  > >>>> This is a production service but with tiny data (~250k
> triples,
> > > ~60MB
> > >  > >>>> as NQuads). Runs on 4GB machines with java heap allocation of
> > > 500MB[1].
> > >  > >>>>
> > >  > >>>> We used to run using 3.16 on jdk 8 (AWS Corretto for the long
> term
> > >  > >>>> support) with no problems.
> > >  > >>>>
> > >  > >>>> Switching to fuseki 4.8.0 on jdk 11 the process grows in the
> > > space of
> > >  > >>>> a day or so to reach ~3GB of memory at which point the 4GB
> machine
> > >  > >>>> becomes unviable and things get OOM killed.
> > >  > >>>>
> > >  > >>>> The strange thing is that this growth happens when the system
> is
> > >  > >>>> answering no Sparql queries at all, just regular health ping
> checks
> > >  > >>>> and (prometheus) metrics scrapes from the monitoring systems.
> > >  > >>>>
> > >  > >>>> Furthermore the space being consumed is not visible to any of
> > > the JVM
> > >  > >>>> metrics:
> > >  > >>>> - Heap and and non-heap are stable at around 100MB total
> (mostly
> > >  > >>>> non-heap metaspace).
> > >  > >>>> - Mapped buffers stay at 50MB and remain long term stable.
> > >  > >>>> - Direct memory buffers being allocated up to around 500MB
> then
> > > being
> > >  > >>>> reclaimed. Since there are no sparql queries at all we assume
> > > this is
> > >  > >>>> jetty NIO buffers being churned as a result of the metric
> scrapes.
> > >  > >>>> However, this direct buffer behaviour seems stable, it cycles
> > > between
> > >  > >>>> 0 and 500MB on approx a 10min cycle but is stable over a
> period of
> > >  > >>>> days and shows no leaks.
> > >  > >>>>
> > >  > >>>> Yet the java process grows from an initial 100MB to at least
> 3GB.
> > >  > >>>> This can occur in the space of a couple of hours or can take
> up to a
> > >  > >>>> day or two with no predictability in how fast.
> > >  > >>>>
> > >  > >>>> Presumably there is some low level JNI space allocated by
> Jetty (?)
> > >  > >>>> which is invisible to all the JVM metrics and is not being
> reliably
> > >  > >>>> reclaimed.
> > >  > >>>>
> > >  > >>>> Trying 4.6.0, which we've had less problems with elsewhere,
> that
> > >  > >>>> seems to grow to around 1GB (plus up to 0.5GB for the cycling
> direct
> > >  > >>>> memory buffers) and then stays stable (at least on a three day
> soak
> > >  > >>>> test). We could live with allocating 1.5GB to a system that
> should
> > >  > >>>> only need a few 100MB but concerned that it may not be stable
> in the
> > >  > >>>> really long term and, in any case, would rather be able to
> update to
> > >  > >>>> more recent fuseki versions.
> > >  > >>>>
> > >  > >>>> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again
> but
> > > then
> > >  > >>>> keeps ticking up slowly at random intervals. We project that
> it
> > > would
> > >  > >>>> take a few weeks to grow the scale it did under java 11 but it
> will
> > >  > >>>> still eventually kill the machine.
> > >  > >>>>
> > >  > >>>> Anyone seem anything remotely like this?
> > >  > >>>>
> > >  > >>>> Dave
> > >  > >>>>
> > >  > >>>> [1]500M heap may be overkill but there can be some complex
> queries
> > >  > >>>> and that should still leave plenty of space for OS buffers etc
> > > in the
> > >  > >>>> remaining memory on a 4GB machine.
> > >  > >>>>
> > >  > >>>>
> > >  > >>>>
> > >  > >
> > >  >
> >

Re: Re: Mystery memory leak in fuseki

Reply via email to