There is one more variable in this picture: Java’s container awareness https://developers.redhat.com/articles/2022/04/19/java-17-whats-new-openjdks-container-awareness
Whether it has an impact in this case, I have no idea :) On Thu, 20 Jul 2023 at 11.48, Conal McLaughlin <conal.mclaugh...@iotics.com> wrote: > Hey Andy, > > Metaspace seems to be stable! > > We’re running this on Java 11 currently. > I can check it out with Java17 though. > > We’ve currently set Xms/Xmx to 2560MB & MaxMetaspaceSize to 256MB. > > The ECS task is set with a ceiling of 4GB Memory & 1vcpu. > > Could it be more of a race condition than size of used objects, due to > logging? > I do see some time sensitive eviction code in Jetty - > https://github.com/eclipse/jetty.project/blob/9e16d81cf8922c75e3d2d96c66442b896a9c69e1/jetty-io/src/main/java/org/eclipse/jetty/io/ArrayRetainableByteBufferPool.java#L374 > Not sure if the same type of thing exists in the Jena codebase. > > I will try to check with `—empty` also. > > > Thanks, > Conal > > On 2023/07/19 21:10:24 Andy Seaborne wrote: > > Conal, > > > > Thanks for the information. > > Can you see if metaspace is growing as well? > > > > All, > > > > Could someone please try running Fuseki main, with no datasets (--empty) > > with some healthcheck ping traffic. > > > > Andy > > > > On 19/07/2023 14:42, Conal McLaughlin wrote: > > > Hey Dave, > > > > > > Thank you for providing an in depth analysis of your issues. > > > We appear to be witnessing the same type of problems with our current > > > Fuseki deployment. > > > We are deploying a containerised Fuseki into a AWS ECS task alongside > > > other containers - this may not be ideal but that’s a different story. > > > > > > I just wanted to add another data point to everything you have > described. > > > Firstly, it does seem like “idle” (or very low traffic) instances are > > > the problem, for us (coupled with a larger heap than necessary). > > > We witness the same increase in the ECS task memory consumption up > until > > > the whole thing is killed off. Which includes the Fuseki container. > > > > > > In an attempt to see what was going on beneath the hood, we turned up > > > the logging to TRACE in the log4j2.xml file provided to Fuseki. > > > This appeared to stabilise the increasing memory consumption. > > > Even just switching the `logger.jetty.level` to TRACE alleviates the > issue. > > > > Colour me confused! > > > > A Log4j logger that is active will use a few objects - may that's enough > > to trigger a minor GC which in turn is enough to flush some non-heap > > resources. > > > > How big is the heap? > > This is Java17? > > > > > We are testing this on Fuseki 4.8.0/TDB2 with close to 0 triples and > > > extremely low query traffic / health checks via /ping. > > > KPk7uhH2F9Lp.png > > > ecs-task-memory - Image on Pasteboard > > > <https://pasteboard.co/KPk7uhH2F9Lp.png> > > > pasteboard.co <https://pasteboard.co/KPk7uhH2F9Lp.png> > > > > > > <https://pasteboard.co/KPk7uhH2F9Lp.png> > > > > > > > > > Cheers, > > > Conal > > > > > > On 2023/07/11 09:31:25 Dave Reynolds wrote: > > > > Hi Rob, > > > > > > > > Good point. Will try to find time to experiment with that but given > the > > > > testing cycle time that will take a while and can't start > immediately. > > > > > > > > I'm a little sceptical though. As mentioned before, all the metrics > we > > > > see show the direct memory pool that Jetty uses cycling up the max > heap > > > > size and then being collected but with no long term growth to match > the > > > > process size growth. This really feels more like a bug (though not > sure > > > > where) than tuning. The fact that actual behaviour doesn't match the > > > > documentation isn't encouraging. > > > > > > > > It's also pretty hard to figure what the right pool configuration > would > > > > be. This thing is just being asked to deliver a few metrics (12KB > per > > > > request) several times a minute but manages to eat 500MB of direct > > > > buffer space every 5mins. So what the right pool parameters are to > > > > support real usage peaks is not going to be easy to figure out. > > > > > > > > None the less you are right. That's something that should be > explored. > > > > > > > > Dave > > > > > > > > > > > > On 11/07/2023 09:45, Rob @ DNR wrote: > > > > > Dave > > > > > > > > > > Thanks for the further information. > > > > > > > > > > Have you experimented with using Jetty 10 but providing more > > > detailed configuration?Fuseki supports providing detailed Jetty > > > configuration if needed via the --jetty-config option > > > > > > > > > > The following section look relevant: > > > > > > > > > > > > > > https://eclipse.dev/jetty/documentation/jetty-10/operations-guide/index.html#og-module-bytebufferpool > > > > > > > > > > It looks like the default is that Jetty uses a heuristic to > > > determine these values, sadly the heuristic in question is not > detailed > > > in that documentation. > > > > > > > > > > Best guess from digging through their code is that the > “heuristic” > > > is this: > > > > > > > > > > > > > > https://github.com/eclipse/jetty.project/blob/jetty-10.0.x/jetty-io/src/main/java/org/eclipse/jetty/io/AbstractByteBufferPool.java#L78-L84 > > > > > > > > > > i.e., ¼ of the configured max heap size.This doesn’t necessarily > > > align with the exact sizes of process growth you see but I note the > > > documentation does explicitly say that buffers used can go beyond > these > > > limits but that those will just be GC’d rather than pooled for reuse. > > > > > > > > > > Example byte buffer configuration at > > > > https://github.com/eclipse/jetty.project/blob/9a05c75ad28ebad4abbe624fa432664c59763747/jetty-server/src/main/config/etc/jetty-bytebufferpool.xml#L4 > > > > > > > > > > Any chance you could try customising this for your needs with > stock > > > Fuseki and see if this allows you to make the process size smaller and > > > sufficiently predictable for your use case? > > > > > > > > > > Rob > > > > > > > > > > From: Dave Reynolds <da...@gmail.com> > > > > > Date: Tuesday, 11 July 2023 at 08:58 > > > > > To: users@jena.apache.org <us...@jena.apache.org> > > > > > Subject: Re: Mystery memory leak in fuseki > > > > > For interest[*] ... > > > > > > > > > > This is what the core JVM metrics look like when transitioning > from a > > > > > Jetty10 to a Jetty9.4 instance. You can see the direct buffer > > > cycling up > > > > > to 500MB (which happens to be the max heap setting) on Jetty 10, > > > nothing > > > > > on Jetty 9. The drop in Mapped buffers is just because TDB hadn't > been > > > > > asked any queries yet. > > > > > > > > > > > > > > https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m&dl=0 > > > > > > > > > > Here' the same metrics around the time of triggering a TDB > backup. > > > Shows > > > > > the mapped buffer use for TDB but no significant impact on heap > etc. > > > > > > > > > > > > > > https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna&dl=0 > > > > > > > > > > These are all on the same instance as the RES memory trace: > > > > > > > > > > > > > > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0 > > > > > > > > > > Dave > > > > > > > > > > [*] I've been staring and metric graphs for so many days I may > have a > > > > > distorted notion of what's interesting :) > > > > > > > > > > On 11/07/2023 08:39, Dave Reynolds wrote: > > > > >> After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in the > > > > >> production, containerized, environment then it is indeed very > stable. > > > > >> > > > > >> Running at less that 6% of memory on 4GB machine compared to > peaks of > > > > >> ~50% for versions with Jetty 10. RES shows as 240K with 35K > shared > > > > >> (presume mostly libraries). > > > > >> > > > > >> Copy of trace is: > > > > >> > > > > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0 > > > > >> > > > > >> The high spikes on left of image are the prior run on with out > of the > > > > >> box 4.7.0 on same JVM. > > > > >> > > > > >> The small spike at 06:00 is a dump so TDB was able to touch and > > > scan all > > > > >> the (modest) data with very minor blip in resident size (as > you'd > > > hope). > > > > >> JVM stats show the mapped buffers for TDB jumping up but confirm > > > heap is > > > > >> stable at < 60M, non-heap 60M. > > > > >> > > > > >> Dave > > > > >> > > > > >> On 10/07/2023 20:52, Dave Reynolds wrote: > > > > >>> Since this thread has got complex, I'm posting this update here > > > at the > > > > >>> top level. > > > > >>> > > > > >>> Thanks to folks, especially Andy and Rob for suggestions and for > > > > >>> investigating. > > > > >>> > > > > >>> After a lot more testing at our end I believe we now have some > > > > >>> workarounds. > > > > >>> > > > > >>> First, at least on java 17, the process growth does seem to > level > > > out. > > > > >>> Despite what I just said to Rob, having just checked our soak > > > tests, a > > > > >>> jena 4.7.0/java 17 test with 500MB max heap has lasted for 7 > days. > > > > >>> Process size oscillates between 1.5GB and 2GB but hasn't gone > above > > > > >>> that in a week. The oscillation is almost entirely the cycling > of the > > > > >>> direct memory buffers used by Jetty. Empirically those cycle up > to > > > > >>> something comparable to the set max heap size, at least for us. > > > > >>> > > > > >>> While this week long test was 4.7.0, based on earlier tests I > suspect > > > > >>> 4.8.0 (and now 4.9.0) would also level out at least on a > timescale of > > > > >>> days. > > > > >>> > > > > >>> The key has been setting the max heap low. At 2GB and even 1GB > (the > > > > >>> default on a 4GB machine) we see higher peak levels of direct > buffers > > > > >>> and overall process size grew to around 3GB at which point the > > > > >>> container is killed on the small machines. Though java 17 does > > > seem to > > > > >>> be better behaved that java 11, so switching to that probably > also > > > > >>> helped. > > > > >>> > > > > >>> Given the actual heap is low (50MB heap, 60MB non-heap) then > needing > > > > >>> 2GB to run in feels high but is workable. So my previously > suggested > > > > >>> rule of thumb that, in this low memory regime, allow 4x the max > heap > > > > >>> size seems to work. > > > > >>> > > > > >>> Second, we're now pretty confident the issue is jetty 10+. > > > > >>> > > > > >>> We've built a fuseki-server 4.9.0 with Jetty replaced by version > > > > >>> 9.4.51.v20230217. This required some minor source changes to > compile > > > > >>> and pass tests. On a local bare metal test where we saw process > > > growth > > > > >>> up to 1.5-2GB this build has run stably using less than 500MB > for 4 > > > > >>> hours. > > > > >>> > > > > >>> We'll set a longer term test running in the target containerized > > > > >>> environment to confirm things but quite hopeful this will be > long > > > term > > > > >>> stable. > > > > >>> > > > > >>> I realise Jetty 9.4.x is out of community support but eclipse > say EOL > > > > >>> is "unlikely to happen before 2025". So, while this may not be a > > > > >>> solution for the Jena project, it could give us a workaround at > the > > > > >>> cost of doing custom builds. > > > > >>> > > > > >>> Dave > > > > >>> > > > > >>> > > > > >>> On 03/07/2023 14:20, Dave Reynolds wrote: > > > > >>>> We have a very strange problem with recent fuseki versions when > > > > >>>> running (in docker containers) on small machines. Suspect a > jetty > > > > >>>> issue but it's not clear. > > > > >>>> > > > > >>>> Wondering if anyone has seen anything like this. > > > > >>>> > > > > >>>> This is a production service but with tiny data (~250k > triples, > > > ~60MB > > > > >>>> as NQuads). Runs on 4GB machines with java heap allocation of > > > 500MB[1]. > > > > >>>> > > > > >>>> We used to run using 3.16 on jdk 8 (AWS Corretto for the long > term > > > > >>>> support) with no problems. > > > > >>>> > > > > >>>> Switching to fuseki 4.8.0 on jdk 11 the process grows in the > > > space of > > > > >>>> a day or so to reach ~3GB of memory at which point the 4GB > machine > > > > >>>> becomes unviable and things get OOM killed. > > > > >>>> > > > > >>>> The strange thing is that this growth happens when the system > is > > > > >>>> answering no Sparql queries at all, just regular health ping > checks > > > > >>>> and (prometheus) metrics scrapes from the monitoring systems. > > > > >>>> > > > > >>>> Furthermore the space being consumed is not visible to any of > > > the JVM > > > > >>>> metrics: > > > > >>>> - Heap and and non-heap are stable at around 100MB total > (mostly > > > > >>>> non-heap metaspace). > > > > >>>> - Mapped buffers stay at 50MB and remain long term stable. > > > > >>>> - Direct memory buffers being allocated up to around 500MB > then > > > being > > > > >>>> reclaimed. Since there are no sparql queries at all we assume > > > this is > > > > >>>> jetty NIO buffers being churned as a result of the metric > scrapes. > > > > >>>> However, this direct buffer behaviour seems stable, it cycles > > > between > > > > >>>> 0 and 500MB on approx a 10min cycle but is stable over a > period of > > > > >>>> days and shows no leaks. > > > > >>>> > > > > >>>> Yet the java process grows from an initial 100MB to at least > 3GB. > > > > >>>> This can occur in the space of a couple of hours or can take > up to a > > > > >>>> day or two with no predictability in how fast. > > > > >>>> > > > > >>>> Presumably there is some low level JNI space allocated by > Jetty (?) > > > > >>>> which is invisible to all the JVM metrics and is not being > reliably > > > > >>>> reclaimed. > > > > >>>> > > > > >>>> Trying 4.6.0, which we've had less problems with elsewhere, > that > > > > >>>> seems to grow to around 1GB (plus up to 0.5GB for the cycling > direct > > > > >>>> memory buffers) and then stays stable (at least on a three day > soak > > > > >>>> test). We could live with allocating 1.5GB to a system that > should > > > > >>>> only need a few 100MB but concerned that it may not be stable > in the > > > > >>>> really long term and, in any case, would rather be able to > update to > > > > >>>> more recent fuseki versions. > > > > >>>> > > > > >>>> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again > but > > > then > > > > >>>> keeps ticking up slowly at random intervals. We project that > it > > > would > > > > >>>> take a few weeks to grow the scale it did under java 11 but it > will > > > > >>>> still eventually kill the machine. > > > > >>>> > > > > >>>> Anyone seem anything remotely like this? > > > > >>>> > > > > >>>> Dave > > > > >>>> > > > > >>>> [1]500M heap may be overkill but there can be some complex > queries > > > > >>>> and that should still leave plenty of space for OS buffers etc > > > in the > > > > >>>> remaining memory on a 4GB machine. > > > > >>>> > > > > >>>> > > > > >>>> > > > > > > > > > > >