Osma,

What are the long running queries?

It is possible that updates are unflushed (a bit surprising given the length of time since over night updates) - you can check this by looking the the journal file and if is zero length, there are no outstanding commits waiting to be flushed.

But the size locked down by unflushed commits does not change due to read load.

Queries can uses a lot of memory and several in parallel would also cause OOME.

How many datasets are on this server?

        Andy

PS Random other experience

On AWS, we have seen virtualization hardware "go bad" (I can't explain it any better). Only seen on old hardware, m1 generation) . A server, for no reason we can determine [*], simply starts having very high load, makes very slow progress but is functionally fine. But it's randomly going slow, queries build up which can means more memory in active use at anyone point meaning OOME is possible. This is not a common occurrence.

[*] We allowed for cycle stealing from co-resident VMs - the slow down is 10x scale.

On 05/06/14 11:43, Rob Vesse wrote:
Osma

Comments inline:

On 05/06/2014 10:11, "Osma Suominen" <[email protected]> wrote:

Hi all!

On 30/05/14 16:36, Mark Feblowitz wrote:

After some amount of time I see a series of messages after update posts

        WARN  [xxxxxx] RC = 500 : Java heap space

And I’m seeing "java.lang.OutOfMemoryError: Java heap space”  errors.

I also got this error yesterday on an important machine.

My setup is this: Fuseki 1.0.1 with a single TDB (no inference) that has
grown to 13GB and an additional jena-text index of 200MB. Fuseki is
given approx. 6GB of heap (-Xmx6000M) on a machine with 16GB RAM. The
machine is a virtual machine running 64bit CentOS 6, kernel
2.6.32-431.11.2.el6.x86_64, java -version gives this:
java version "1.7.0_51"
OpenJDK Runtime Environment (rhel-2.4.4.1.el6_5-x86_64 u51-b02)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)


At the time this happened there were no updates (we usually only update
during the night), just read-only SELECT and CONSTRUCT queries coming in
at around 8 queries per second on average.

As has been explained in this thread the problem is that the continuous
query load prevents the in-memory transaction journal from being fully
flushed to disk.  If the read only queries continue through the night
while you do updates then they will block the journal flush and you will
eventually hit this case

JENA-703 (https://issues.apache.org/jira/browse/JENA-703) describes the
proposed fix for this issue but the side effect of that fix (as and when
it gets implemented) will be that for a system under continuous load reads
will be occasionally blocked and therefore some queries may experience
delays.


Suddenly queries stop working and CPU usage rises to around 350% (the
machine has 4 cores). Errors like this appear in the Fuseki log:

2014-06-04 11:42:52,293 WARN Fuseki               :: [14739636] RC = 500
: Java heap space
java.lang.OutOfMemoryError: Java heap space
2014-06-04 11:39:06,670 WARN Fuseki               :: [14739587] RC = 500
: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
2014-06-04 11:43:54,817 INFO Fuseki               :: [14739587] 500 GC
overhead limit exceeded (1,032.111 s)
2014-06-04 12:24:56,868 WARN Fuseki               :: [14739738] RC = 500
: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
2014-06-04 12:24:04,660 WARN Fuseki               :: [14739862] RC = 500
: Java heap space
java.lang.OutOfMemoryError: Java heap space
2014-06-04 12:43:44,167 INFO Fuseki               :: [14739738] 500 GC
overhead limit exceeded (2,227.717 s)
2014-06-04 12:43:44,167 INFO Fuseki               :: [14739862] 500 Java
heap space (2,227.719 s)
2014-06-04 14:33:30,906 WARN Fuseki               :: [14740021] RC = 500
: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
2014-06-04 14:34:21,722 INFO Fuseki               :: [14740021] 500 GC
overhead limit exceeded (4,850.886 s)


The timestamps in the log entries are not always in order, as you can
see above. Sometimes it takes more than an hour for an individual query
to fail (see last entry).

Needless to say this is a bit nasty way of failing - the process is
running, consuming nearly all CPU, but responding to queries very slowly
or not at all. It would even be better if the process just died, so
something else could restart it. I am considering using a tool such as
Monit to watch the Fuseki process and restart it if it starts behaving
oddly.

This is really a JVM issue and not something we can control.  The JVM
allows catching OOM errors (for better or worse) but in doing so it often
leaves applications in a state where they are extremely close to the heap
limit and so the user code hangs while the JVM furiously tries to GC
enough memory for it to continue.


Am I doing something obviously wrong here?

No, this is a known limitation of TDBs architecture


Should I just give the JVM
even more memory, or adjust some of the other JVM options?

That only prolongs the time to failure and TDB relies heavily on memory
mapped files which are off heap so increasing the heap size impacts
performance because it cause more swapping at the OS level

Is there any
way to force the GC to fail faster, or otherwise avoid futile attempts
of freeing more memory?

You can try the solution detailed at
http://stackoverflow.com/a/3878199/107591

Add -XX:OnOutOfMemoryError="kill -9 %p"

That is an Oracle JVM option so no guarantee it works on OpenJDK

Rob

Even better, could Fuseki just stick to the
memory it is given?

Thanks,
Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi





Reply via email to