Re: Mystery memory leak in fuseki
You might try running `ltrace` to watch the library calls and system calls the jvm is making. e.g. ltrace -S -f -p I think the `sbrk` system call is used to allocate memory. It might be interesting to see if you can catch the jvm invoking that system call and also see what is happening around it. On Mon, Jul 3, 2023 at 10:50 AM Dave Reynolds wrote: > On 03/07/2023 14:36, Martynas Jusevičius wrote: > > There have been a few similar threads: > > > > https://www.mail-archive.com/users@jena.apache.org/msg19871.html > > > > https://www.mail-archive.com/users@jena.apache.org/msg18825.html > > > Thanks, I've seen those and not sure they quite match our case but maybe > I'm mistaken. > > We already have a smallish heap allocation (500MB) which seem to be a > key conclusion of both those threads. Though I guess we could try even > lower. > > Furthermore the second thread was related to 3.16.0 which is completely > stable for us at 150MB (rather than the 1.5GB that 4.6.* gets to, let > alone the 3+GB that gets 4.8.0 killed). > > Dave > > > > > > On Mon, 3 Jul 2023 at 15.20, Dave Reynolds > > wrote: > > > >> We have a very strange problem with recent fuseki versions when running > >> (in docker containers) on small machines. Suspect a jetty issue but it's > >> not clear. > >> > >> Wondering if anyone has seen anything like this. > >> > >> This is a production service but with tiny data (~250k triples, ~60MB as > >> NQuads). Runs on 4GB machines with java heap allocation of 500MB[1]. > >> > >> We used to run using 3.16 on jdk 8 (AWS Corretto for the long term > >> support) with no problems. > >> > >> Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of a > >> day or so to reach ~3GB of memory at which point the 4GB machine becomes > >> unviable and things get OOM killed. > >> > >> The strange thing is that this growth happens when the system is > >> answering no Sparql queries at all, just regular health ping checks and > >> (prometheus) metrics scrapes from the monitoring systems. > >> > >> Furthermore the space being consumed is not visible to any of the JVM > >> metrics: > >> - Heap and and non-heap are stable at around 100MB total (mostly > >> non-heap metaspace). > >> - Mapped buffers stay at 50MB and remain long term stable. > >> - Direct memory buffers being allocated up to around 500MB then being > >> reclaimed. Since there are no sparql queries at all we assume this is > >> jetty NIO buffers being churned as a result of the metric scrapes. > >> However, this direct buffer behaviour seems stable, it cycles between 0 > >> and 500MB on approx a 10min cycle but is stable over a period of days > >> and shows no leaks. > >> > >> Yet the java process grows from an initial 100MB to at least 3GB. This > >> can occur in the space of a couple of hours or can take up to a day or > >> two with no predictability in how fast. > >> > >> Presumably there is some low level JNI space allocated by Jetty (?) > >> which is invisible to all the JVM metrics and is not being reliably > >> reclaimed. > >> > >> Trying 4.6.0, which we've had less problems with elsewhere, that seems > >> to grow to around 1GB (plus up to 0.5GB for the cycling direct memory > >> buffers) and then stays stable (at least on a three day soak test). We > >> could live with allocating 1.5GB to a system that should only need a few > >> 100MB but concerned that it may not be stable in the really long term > >> and, in any case, would rather be able to update to more recent fuseki > >> versions. > >> > >> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then > >> keeps ticking up slowly at random intervals. We project that it would > >> take a few weeks to grow the scale it did under java 11 but it will > >> still eventually kill the machine. > >> > >> Anyone seem anything remotely like this? > >> > >> Dave > >> > >> [1] 500M heap may be overkill but there can be some complex queries and > >> that should still leave plenty of space for OS buffers etc in the > >> remaining memory on a 4GB machine. > >> > >> > >> > >> > > >
Re: Mystery memory leak in fuseki
On 03/07/2023 14:36, Martynas Jusevičius wrote: There have been a few similar threads: https://www.mail-archive.com/users@jena.apache.org/msg19871.html https://www.mail-archive.com/users@jena.apache.org/msg18825.html Thanks, I've seen those and not sure they quite match our case but maybe I'm mistaken. We already have a smallish heap allocation (500MB) which seem to be a key conclusion of both those threads. Though I guess we could try even lower. Furthermore the second thread was related to 3.16.0 which is completely stable for us at 150MB (rather than the 1.5GB that 4.6.* gets to, let alone the 3+GB that gets 4.8.0 killed). Dave On Mon, 3 Jul 2023 at 15.20, Dave Reynolds wrote: We have a very strange problem with recent fuseki versions when running (in docker containers) on small machines. Suspect a jetty issue but it's not clear. Wondering if anyone has seen anything like this. This is a production service but with tiny data (~250k triples, ~60MB as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1]. We used to run using 3.16 on jdk 8 (AWS Corretto for the long term support) with no problems. Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of a day or so to reach ~3GB of memory at which point the 4GB machine becomes unviable and things get OOM killed. The strange thing is that this growth happens when the system is answering no Sparql queries at all, just regular health ping checks and (prometheus) metrics scrapes from the monitoring systems. Furthermore the space being consumed is not visible to any of the JVM metrics: - Heap and and non-heap are stable at around 100MB total (mostly non-heap metaspace). - Mapped buffers stay at 50MB and remain long term stable. - Direct memory buffers being allocated up to around 500MB then being reclaimed. Since there are no sparql queries at all we assume this is jetty NIO buffers being churned as a result of the metric scrapes. However, this direct buffer behaviour seems stable, it cycles between 0 and 500MB on approx a 10min cycle but is stable over a period of days and shows no leaks. Yet the java process grows from an initial 100MB to at least 3GB. This can occur in the space of a couple of hours or can take up to a day or two with no predictability in how fast. Presumably there is some low level JNI space allocated by Jetty (?) which is invisible to all the JVM metrics and is not being reliably reclaimed. Trying 4.6.0, which we've had less problems with elsewhere, that seems to grow to around 1GB (plus up to 0.5GB for the cycling direct memory buffers) and then stays stable (at least on a three day soak test). We could live with allocating 1.5GB to a system that should only need a few 100MB but concerned that it may not be stable in the really long term and, in any case, would rather be able to update to more recent fuseki versions. Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then keeps ticking up slowly at random intervals. We project that it would take a few weeks to grow the scale it did under java 11 but it will still eventually kill the machine. Anyone seem anything remotely like this? Dave [1] 500M heap may be overkill but there can be some complex queries and that should still leave plenty of space for OS buffers etc in the remaining memory on a 4GB machine.
Re: Mystery memory leak in fuseki
On 03/07/2023 15:07, Andy Seaborne wrote: A possibility: https://www.nickebbitt.com/blog/2022/01/26/the-story-of-a-java-17-native-memory-leak/ suggests workaround -XX:-UseStringDeduplication https://bugs.openjdk.org/browse/JDK-8277981 https://github.com/openjdk/jdk/pull/6613 which may be in Java 17.0.2 Ah, thanks hadn't spotted that. Though I was testing with 17.0.7 and, as you say, they claim that was fixed in 17.02. Dave
Re: Mystery memory leak in fuseki
Hi Andy, > Could you try 4.7.0? Will do, though each test takes quite a while :) > This is an in-memory database? No TDB1, sorry should have said that. Though as I say we are leaving the system to soak with absolutely no queries arriving so it's not TDB churn and it's RSS that's filling up. FWIW 3.16.0 runs at 150MB with the same max heap setting, completely stable. So that's 10x smaller than 4.6.0 stabilizes at. If nothing else that confirms that the container set up itself is not to blame. > Micrometer/Prometheus has had several upgrades but if it is not heap and > not direct memory (I though that was a hard bound set at start up), I > don't see how it can be involved. Likewise. Dave On 03/07/2023 14:54, Andy Seaborne wrote: Hi Dave, Could you try 4.7.0? 4.6.0 was 2022-08-20 4.7.0 was 2022-12-27 4.8.0 was 2023-04-20 This is an in-memory database? Micrometer/Prometheus has had several upgrades but if it is not heap and not direct memory (I though that was a hard bound set at start up), I don't see how it can be involved. Andy On 03/07/2023 14:20, Dave Reynolds wrote: We have a very strange problem with recent fuseki versions when running (in docker containers) on small machines. Suspect a jetty issue but it's not clear. Wondering if anyone has seen anything like this. This is a production service but with tiny data (~250k triples, ~60MB as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1]. We used to run using 3.16 on jdk 8 (AWS Corretto for the long term support) with no problems. Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of a day or so to reach ~3GB of memory at which point the 4GB machine becomes unviable and things get OOM killed. The strange thing is that this growth happens when the system is answering no Sparql queries at all, just regular health ping checks and (prometheus) metrics scrapes from the monitoring systems. Furthermore the space being consumed is not visible to any of the JVM metrics: - Heap and and non-heap are stable at around 100MB total (mostly non-heap metaspace). - Mapped buffers stay at 50MB and remain long term stable. - Direct memory buffers being allocated up to around 500MB then being reclaimed. Since there are no sparql queries at all we assume this is jetty NIO buffers being churned as a result of the metric scrapes. However, this direct buffer behaviour seems stable, it cycles between 0 and 500MB on approx a 10min cycle but is stable over a period of days and shows no leaks. Yet the java process grows from an initial 100MB to at least 3GB. This can occur in the space of a couple of hours or can take up to a day or two with no predictability in how fast. Presumably there is some low level JNI space allocated by Jetty (?) which is invisible to all the JVM metrics and is not being reliably reclaimed. Trying 4.6.0, which we've had less problems with elsewhere, that seems to grow to around 1GB (plus up to 0.5GB for the cycling direct memory buffers) and then stays stable (at least on a three day soak test). We could live with allocating 1.5GB to a system that should only need a few 100MB but concerned that it may not be stable in the really long term and, in any case, would rather be able to update to more recent fuseki versions. Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then keeps ticking up slowly at random intervals. We project that it would take a few weeks to grow the scale it did under java 11 but it will still eventually kill the machine. Anyone seem anything remotely like this? Dave [1] 500M heap may be overkill but there can be some complex queries and that should still leave plenty of space for OS buffers etc in the remaining memory on a 4GB machine.
Re: Mystery memory leak in fuseki
A possibility: https://www.nickebbitt.com/blog/2022/01/26/the-story-of-a-java-17-native-memory-leak/ suggests workaround -XX:-UseStringDeduplication https://bugs.openjdk.org/browse/JDK-8277981 https://github.com/openjdk/jdk/pull/6613 which may be in Java 17.0.2 Andy
Re: Mystery memory leak in fuseki
Hi Dave, Could you try 4.7.0? 4.6.0 was 2022-08-20 4.7.0 was 2022-12-27 4.8.0 was 2023-04-20 This is an in-memory database? Micrometer/Prometheus has had several upgrades but if it is not heap and not direct memory (I though that was a hard bound set at start up), I don't see how it can be involved. Andy On 03/07/2023 14:20, Dave Reynolds wrote: We have a very strange problem with recent fuseki versions when running (in docker containers) on small machines. Suspect a jetty issue but it's not clear. Wondering if anyone has seen anything like this. This is a production service but with tiny data (~250k triples, ~60MB as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1]. We used to run using 3.16 on jdk 8 (AWS Corretto for the long term support) with no problems. Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of a day or so to reach ~3GB of memory at which point the 4GB machine becomes unviable and things get OOM killed. The strange thing is that this growth happens when the system is answering no Sparql queries at all, just regular health ping checks and (prometheus) metrics scrapes from the monitoring systems. Furthermore the space being consumed is not visible to any of the JVM metrics: - Heap and and non-heap are stable at around 100MB total (mostly non-heap metaspace). - Mapped buffers stay at 50MB and remain long term stable. - Direct memory buffers being allocated up to around 500MB then being reclaimed. Since there are no sparql queries at all we assume this is jetty NIO buffers being churned as a result of the metric scrapes. However, this direct buffer behaviour seems stable, it cycles between 0 and 500MB on approx a 10min cycle but is stable over a period of days and shows no leaks. Yet the java process grows from an initial 100MB to at least 3GB. This can occur in the space of a couple of hours or can take up to a day or two with no predictability in how fast. Presumably there is some low level JNI space allocated by Jetty (?) which is invisible to all the JVM metrics and is not being reliably reclaimed. Trying 4.6.0, which we've had less problems with elsewhere, that seems to grow to around 1GB (plus up to 0.5GB for the cycling direct memory buffers) and then stays stable (at least on a three day soak test). We could live with allocating 1.5GB to a system that should only need a few 100MB but concerned that it may not be stable in the really long term and, in any case, would rather be able to update to more recent fuseki versions. Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then keeps ticking up slowly at random intervals. We project that it would take a few weeks to grow the scale it did under java 11 but it will still eventually kill the machine. Anyone seem anything remotely like this? Dave [1] 500M heap may be overkill but there can be some complex queries and that should still leave plenty of space for OS buffers etc in the remaining memory on a 4GB machine.
Re: Mystery memory leak in fuseki
There have been a few similar threads: https://www.mail-archive.com/users@jena.apache.org/msg19871.html https://www.mail-archive.com/users@jena.apache.org/msg18825.html On Mon, 3 Jul 2023 at 15.20, Dave Reynolds wrote: > We have a very strange problem with recent fuseki versions when running > (in docker containers) on small machines. Suspect a jetty issue but it's > not clear. > > Wondering if anyone has seen anything like this. > > This is a production service but with tiny data (~250k triples, ~60MB as > NQuads). Runs on 4GB machines with java heap allocation of 500MB[1]. > > We used to run using 3.16 on jdk 8 (AWS Corretto for the long term > support) with no problems. > > Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of a > day or so to reach ~3GB of memory at which point the 4GB machine becomes > unviable and things get OOM killed. > > The strange thing is that this growth happens when the system is > answering no Sparql queries at all, just regular health ping checks and > (prometheus) metrics scrapes from the monitoring systems. > > Furthermore the space being consumed is not visible to any of the JVM > metrics: > - Heap and and non-heap are stable at around 100MB total (mostly > non-heap metaspace). > - Mapped buffers stay at 50MB and remain long term stable. > - Direct memory buffers being allocated up to around 500MB then being > reclaimed. Since there are no sparql queries at all we assume this is > jetty NIO buffers being churned as a result of the metric scrapes. > However, this direct buffer behaviour seems stable, it cycles between 0 > and 500MB on approx a 10min cycle but is stable over a period of days > and shows no leaks. > > Yet the java process grows from an initial 100MB to at least 3GB. This > can occur in the space of a couple of hours or can take up to a day or > two with no predictability in how fast. > > Presumably there is some low level JNI space allocated by Jetty (?) > which is invisible to all the JVM metrics and is not being reliably > reclaimed. > > Trying 4.6.0, which we've had less problems with elsewhere, that seems > to grow to around 1GB (plus up to 0.5GB for the cycling direct memory > buffers) and then stays stable (at least on a three day soak test). We > could live with allocating 1.5GB to a system that should only need a few > 100MB but concerned that it may not be stable in the really long term > and, in any case, would rather be able to update to more recent fuseki > versions. > > Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then > keeps ticking up slowly at random intervals. We project that it would > take a few weeks to grow the scale it did under java 11 but it will > still eventually kill the machine. > > Anyone seem anything remotely like this? > > Dave > > [1] 500M heap may be overkill but there can be some complex queries and > that should still leave plenty of space for OS buffers etc in the > remaining memory on a 4GB machine. > > > >
Mystery memory leak in fuseki
We have a very strange problem with recent fuseki versions when running (in docker containers) on small machines. Suspect a jetty issue but it's not clear. Wondering if anyone has seen anything like this. This is a production service but with tiny data (~250k triples, ~60MB as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1]. We used to run using 3.16 on jdk 8 (AWS Corretto for the long term support) with no problems. Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of a day or so to reach ~3GB of memory at which point the 4GB machine becomes unviable and things get OOM killed. The strange thing is that this growth happens when the system is answering no Sparql queries at all, just regular health ping checks and (prometheus) metrics scrapes from the monitoring systems. Furthermore the space being consumed is not visible to any of the JVM metrics: - Heap and and non-heap are stable at around 100MB total (mostly non-heap metaspace). - Mapped buffers stay at 50MB and remain long term stable. - Direct memory buffers being allocated up to around 500MB then being reclaimed. Since there are no sparql queries at all we assume this is jetty NIO buffers being churned as a result of the metric scrapes. However, this direct buffer behaviour seems stable, it cycles between 0 and 500MB on approx a 10min cycle but is stable over a period of days and shows no leaks. Yet the java process grows from an initial 100MB to at least 3GB. This can occur in the space of a couple of hours or can take up to a day or two with no predictability in how fast. Presumably there is some low level JNI space allocated by Jetty (?) which is invisible to all the JVM metrics and is not being reliably reclaimed. Trying 4.6.0, which we've had less problems with elsewhere, that seems to grow to around 1GB (plus up to 0.5GB for the cycling direct memory buffers) and then stays stable (at least on a three day soak test). We could live with allocating 1.5GB to a system that should only need a few 100MB but concerned that it may not be stable in the really long term and, in any case, would rather be able to update to more recent fuseki versions. Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then keeps ticking up slowly at random intervals. We project that it would take a few weeks to grow the scale it did under java 11 but it will still eventually kill the machine. Anyone seem anything remotely like this? Dave [1] 500M heap may be overkill but there can be some complex queries and that should still leave plenty of space for OS buffers etc in the remaining memory on a 4GB machine.
Any23 - moving to the attic.
The Any23 project is being retired. https://attic.apache.org/projects/any23.html Development and maintenance of Any23 is no long happening to keep the code up-to-date with new versions of dependencies with security issues. The Any23 PMC believes the best course of action is to retire the project and move Any23 into the Apache Attic. The ASF board has accepted this resolution. The attic is not one-way - a project can reboot if three+ people want to contribute, form a new PMC, and bring it back out. With ASF members involved, it does not necessarily have to return via the incubator. The Attic process means: * git repos becomes read-only - it can still be cloned and forked * archived releases, including 2.7, remain accessible https://archive.apache.org/dist/any23/ * all maven artifacts remain accessible * the current release area is closed down * mailing lists are closed down * the automated build is turned off Andy on behalf of the Any23 PMC