Conal,

Thanks for the information.
Can you see if metaspace is growing as well?

All,

Could someone please try running Fuseki main, with no datasets (--empty) with some healthcheck ping traffic.

    Andy

On 19/07/2023 14:42, Conal McLaughlin wrote:
Hey Dave,

Thank you for providing an in depth analysis of your issues.
We appear to be witnessing the same type of problems with our current Fuseki deployment. We are deploying a containerised Fuseki into a AWS ECS task alongside other containers - this may not be ideal but that’s a different story.

I just wanted to add another data point to everything you have described.
Firstly, it does seem like “idle” (or very low traffic) instances are the problem, for us (coupled with a larger heap than necessary). We witness the same increase in the ECS task memory consumption up until the whole thing is killed off. Which includes the Fuseki container.

In an attempt to see what was going on beneath the hood, we turned up the logging to TRACE in the log4j2.xml file provided to Fuseki.
This appeared to stabilise the increasing memory consumption.
Even just switching the `logger.jetty.level` to TRACE alleviates the issue.

Colour me confused!

A Log4j logger that is active will use a few objects - may that's enough to trigger a minor GC which in turn is enough to flush some non-heap resources.

How big is the heap?
This is Java17?

We are testing this on Fuseki 4.8.0/TDB2 with close to 0 triples and extremely low query traffic / health checks via /ping.
KPk7uhH2F9Lp.png
ecs-task-memory - Image on Pasteboard <https://pasteboard.co/KPk7uhH2F9Lp.png>
pasteboard.co <https://pasteboard.co/KPk7uhH2F9Lp.png>

<https://pasteboard.co/KPk7uhH2F9Lp.png>


Cheers,
Conal

On 2023/07/11 09:31:25 Dave Reynolds wrote:
 > Hi Rob,
 >
 > Good point. Will try to find time to experiment with that but given the
 > testing cycle time that will take a while and can't start immediately.
 >
 > I'm a little sceptical though. As mentioned before, all the metrics we
 > see show the direct memory pool that Jetty uses cycling up the max heap
 > size and then being collected but with no long term growth to match the
 > process size growth. This really feels more like a bug (though not sure
 > where) than tuning. The fact that actual behaviour doesn't match the
 > documentation isn't encouraging.
 >
 > It's also pretty hard to figure what the right pool configuration would
 > be. This thing is just being asked to deliver a few metrics (12KB per
 > request) several times a minute but manages to eat 500MB of direct
 > buffer space every 5mins. So what the right pool parameters are to
 > support real usage peaks is not going to be easy to figure out.
 >
 > None the less you are right. That's something that should be explored.
 >
 > Dave
 >
 >
 > On 11/07/2023 09:45, Rob @ DNR wrote:
 > > Dave
 > >
 > > Thanks for the further information.
 > >
> > Have you experimented with using Jetty 10 but providing more detailed configuration?Fuseki supports providing detailed Jetty configuration if needed via the --jetty-config option
 > >
 > > The following section look relevant:
 > >
> > https://eclipse.dev/jetty/documentation/jetty-10/operations-guide/index.html#og-module-bytebufferpool
 > >
> > It looks like the default is that Jetty uses a heuristic to determine these values, sadly the heuristic in question is not detailed in that documentation.
 > >
> > Best guess from digging through their code is that the “heuristic” is this:
 > >
> > https://github.com/eclipse/jetty.project/blob/jetty-10.0.x/jetty-io/src/main/java/org/eclipse/jetty/io/AbstractByteBufferPool.java#L78-L84
 > >
> > i.e., ¼ of the configured max heap size.This doesn’t necessarily align with the exact sizes of process growth you see but I note the documentation does explicitly say that buffers used can go beyond these limits but that those will just be GC’d rather than pooled for reuse.
 > >
> > Example byte buffer configuration at https://github.com/eclipse/jetty.project/blob/9a05c75ad28ebad4abbe624fa432664c59763747/jetty-server/src/main/config/etc/jetty-bytebufferpool.xml#L4
 > >
> > Any chance you could try customising this for your needs with stock Fuseki and see if this allows you to make the process size smaller and sufficiently predictable for your use case?
 > >
 > > Rob
 > >
 > > From: Dave Reynolds <da...@gmail.com>
 > > Date: Tuesday, 11 July 2023 at 08:58
 > > To: users@jena.apache.org <us...@jena.apache.org>
 > > Subject: Re: Mystery memory leak in fuseki
 > > For interest[*] ...
 > >
 > > This is what the core JVM metrics look like when transitioning from a
> > Jetty10 to a Jetty9.4 instance. You can see the direct buffer cycling up > > to 500MB (which happens to be the max heap setting) on Jetty 10, nothing
 > > on Jetty 9. The drop in Mapped buffers is just because TDB hadn't been
 > > asked any queries yet.
 > >
> > https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m&dl=0
 > >
> > Here' the same metrics around the time of triggering a TDB backup. Shows
 > > the mapped buffer use for TDB but no significant impact on heap etc.
 > >
> > https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna&dl=0
 > >
 > > These are all on the same instance as the RES memory trace:
 > >
> > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
 > >
 > > Dave
 > >
 > > [*] I've been staring and metric graphs for so many days I may have a
 > > distorted notion of what's interesting :)
 > >
 > > On 11/07/2023 08:39, Dave Reynolds wrote:
 > >> After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in the
 > >> production, containerized, environment then it is indeed very stable.
 > >>
 > >> Running at less that 6% of memory on 4GB machine compared to peaks of
 > >> ~50% for versions with Jetty 10. RES shows as 240K with 35K shared
 > >> (presume mostly libraries).
 > >>
 > >> Copy of trace is:
> >> https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
 > >>
 > >> The high spikes on left of image are the prior run on with out of the
 > >> box 4.7.0 on same JVM.
 > >>
> >> The small spike at 06:00 is a dump so TDB was able to touch and scan all > >> the (modest) data with very minor blip in resident size (as you'd hope). > >> JVM stats show the mapped buffers for TDB jumping up but confirm heap is
 > >> stable at < 60M, non-heap 60M.
 > >>
 > >> Dave
 > >>
 > >> On 10/07/2023 20:52, Dave Reynolds wrote:
> >>> Since this thread has got complex, I'm posting this update here at the
 > >>> top level.
 > >>>
 > >>> Thanks to folks, especially Andy and Rob for suggestions and for
 > >>> investigating.
 > >>>
 > >>> After a lot more testing at our end I believe we now have some
 > >>> workarounds.
 > >>>
> >>> First, at least on java 17, the process growth does seem to level out. > >>> Despite what I just said to Rob, having just checked our soak tests, a
 > >>> jena 4.7.0/java 17 test with 500MB max heap has lasted for 7 days.
 > >>> Process size oscillates between 1.5GB and 2GB but hasn't gone above
 > >>> that in a week. The oscillation is almost entirely the cycling of the
 > >>> direct memory buffers used by Jetty. Empirically those cycle up to
 > >>> something comparable to the set max heap size, at least for us.
 > >>>
 > >>> While this week long test was 4.7.0, based on earlier tests I suspect
 > >>> 4.8.0 (and now 4.9.0) would also level out at least on a timescale of
 > >>> days.
 > >>>
 > >>> The key has been setting the max heap low. At 2GB and even 1GB (the
 > >>> default on a 4GB machine) we see higher peak levels of direct buffers
 > >>> and overall process size grew to around 3GB at which point the
> >>> container is killed on the small machines. Though java 17 does seem to
 > >>> be better behaved that java 11, so switching to that probably also
 > >>> helped.
 > >>>
 > >>> Given the actual heap is low (50MB heap, 60MB non-heap) then needing
 > >>> 2GB to run in feels high but is workable. So my previously suggested
 > >>> rule of thumb that, in this low memory regime, allow 4x the max heap
 > >>> size seems to work.
 > >>>
 > >>> Second, we're now pretty confident the issue is jetty 10+.
 > >>>
 > >>> We've built a fuseki-server 4.9.0 with Jetty replaced by version
 > >>> 9.4.51.v20230217. This required some minor source changes to compile
> >>> and pass tests. On a local bare metal test where we saw process growth
 > >>> up to 1.5-2GB this build has run stably using less than 500MB for 4
 > >>> hours.
 > >>>
 > >>> We'll set a longer term test running in the target containerized
> >>> environment to confirm things but quite hopeful this will be long term
 > >>> stable.
 > >>>
 > >>> I realise Jetty 9.4.x is out of community support but eclipse say EOL
 > >>> is "unlikely to happen before 2025". So, while this may not be a
 > >>> solution for the Jena project, it could give us a workaround at the
 > >>> cost of doing custom builds.
 > >>>
 > >>> Dave
 > >>>
 > >>>
 > >>> On 03/07/2023 14:20, Dave Reynolds wrote:
 > >>>> We have a very strange problem with recent fuseki versions when
 > >>>> running (in docker containers) on small machines. Suspect a jetty
 > >>>> issue but it's not clear.
 > >>>>
 > >>>> Wondering if anyone has seen anything like this.
 > >>>>
> >>>> This is a production service but with tiny data (~250k triples, ~60MB > >>>> as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].
 > >>>>
 > >>>> We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
 > >>>> support) with no problems.
 > >>>>
> >>>> Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of
 > >>>> a day or so to reach ~3GB of memory at which point the 4GB machine
 > >>>> becomes unviable and things get OOM killed.
 > >>>>
 > >>>> The strange thing is that this growth happens when the system is
 > >>>> answering no Sparql queries at all, just regular health ping checks
 > >>>> and (prometheus) metrics scrapes from the monitoring systems.
 > >>>>
> >>>> Furthermore the space being consumed is not visible to any of the JVM
 > >>>> metrics:
 > >>>> - Heap and and non-heap are stable at around 100MB total (mostly
 > >>>> non-heap metaspace).
 > >>>> - Mapped buffers stay at 50MB and remain long term stable.
> >>>> - Direct memory buffers being allocated up to around 500MB then being > >>>> reclaimed. Since there are no sparql queries at all we assume this is
 > >>>> jetty NIO buffers being churned as a result of the metric scrapes.
> >>>> However, this direct buffer behaviour seems stable, it cycles between
 > >>>> 0 and 500MB on approx a 10min cycle but is stable over a period of
 > >>>> days and shows no leaks.
 > >>>>
 > >>>> Yet the java process grows from an initial 100MB to at least 3GB.
 > >>>> This can occur in the space of a couple of hours or can take up to a
 > >>>> day or two with no predictability in how fast.
 > >>>>
 > >>>> Presumably there is some low level JNI space allocated by Jetty (?)
 > >>>> which is invisible to all the JVM metrics and is not being reliably
 > >>>> reclaimed.
 > >>>>
 > >>>> Trying 4.6.0, which we've had less problems with elsewhere, that
 > >>>> seems to grow to around 1GB (plus up to 0.5GB for the cycling direct
 > >>>> memory buffers) and then stays stable (at least on a three day soak
 > >>>> test). We could live with allocating 1.5GB to a system that should
 > >>>> only need a few 100MB but concerned that it may not be stable in the
 > >>>> really long term and, in any case, would rather be able to update to
 > >>>> more recent fuseki versions.
 > >>>>
> >>>> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then > >>>> keeps ticking up slowly at random intervals. We project that it would
 > >>>> take a few weeks to grow the scale it did under java 11 but it will
 > >>>> still eventually kill the machine.
 > >>>>
 > >>>> Anyone seem anything remotely like this?
 > >>>>
 > >>>> Dave
 > >>>>
 > >>>> [1]500M heap may be overkill but there can be some complex queries
> >>>> and that should still leave plenty of space for OS buffers etc in the
 > >>>> remaining memory on a 4GB machine.
 > >>>>
 > >>>>
 > >>>>
 > >
 >

Reply via email to