Dan Pritts <mailto:da...@umich.edu>
June 15, 2018 at 11:38 AM
Will do, thanks.
Dan Pritts <mailto:da...@umich.edu>
June 14, 2018 at 10:48 PM
So the issue is that memory goes up, that is the heap expands to the
maximum Xmx size set? The JVM does not return any heap back to the
OS (as far as I know) so if all the applications grow their heaps,
the real RAM to match that or swapping may result.
Hi Andy,
thanks for taking the time to help.
The problem is that the NON-HEAP memory usage skyrockets.
I "allocate" memory for the heap. The gc logs suggested that I was
never exceeding 6GB of heap in use, even when things went to hell. So
I set the heap to 10GB.
Now that I know we're using NIO, I "allocate" memory for NIO to hold
the entire index in ram. the db is 2.4GB on disk. I don't know NIO
well but this seems plausible.
let's throw another gig at java for its own internal use.
That would add up to 10 + 2.4 + 1 = 13.4GB of memory i might expect
java to use. There's nothing else on the server except apache, linux,
and a few system daemons (postfix, etc).
I upgraded to 3.7 and put fuseki on its own AWS instance last night.
RAM was 16GB and swap 10GB.
once today it filled ram & swap such that linux whacked the jvm
process. Two other times today it was swapping heavily (5GB or swap
used) and we restarted fuseki before the system ran out of swap.
For some reason, the JVM running fuseki+jetty is going nuts with its
memory usage. It *is* using more heap than usual when this happens,
but it's not using more than the 10GB I allocated. At least, not
according to the garbage collection logs.
We have had this problem a few times in the past - memory usage would
spike drastically. We'd always attributed it to a slow memory leak,
and decided we should restart fuseki regularly. But in the last
couple weeks it's happened probably a dozen times.
after the third time today, I put it on a 32GB instance. Of course,
the problem hasn't happened since.
A couple of possibilities:
1/ A query does an ORDER BY that involves a large set of results to
sort. This then drives up the heap requirement, the JVM gorws the
heap and now the process is larger. There may well be a CPU spike at
this time.
2/ Updates are building up. The journal isn't flushed to the main
database until there is a quiet moment and with the high query rate
you may get bursts of time when it is not quiet. The updates are
safe in the journal (the commit happened) but also in-memory as an
overlay on the database. The overlays are collapsed when there are
no readers or writers.
What might be happening is that there isn't a quiet moment.
The traffic is certainly steady - it was about 1500 hits/minute today
when we first crashed.
Big sudden jump would imply a big update as well.
Setting the log into INFO (and, yes, at load it does get big)
What you are looking for is overlaps of query/updates so that the log
shows true concurrent execution (i.e [1] starts, [2] starts, [1]
finishes logged after [2] starts) around the time the size grows
quickly and check the size of updates.
I will look for this. I am dubious, though. We don't make many
writes, and those we do are not very big. Our dataset is metadata
about our archive. The archive is 50 years old, and grows steadily
but slowly.
we had disabled the fuseki log but left httpd logging enabled because
each was huge. Unfortunately the updates were all in POSTs, which i
hadn't noticed until i went looking just now. So I will have to wait
until next time.
thanks
danno
Dan Pritts <mailto:da...@umich.edu>
June 12, 2018 at 10:49 PM
We had the problem again today.
Load was higher than average, but again not lunacy - about 3k hits per
minute. There is no immediately obviously bad query, although i
hardly know what to look for in the sparql - i just looked for
extra-long statements. Nothing in the fuseki.log at all within an
hour of the event. As you know the logs are verbose, so we have
logging set to "WARN" for just about everything. I'll append the
log4j.properties to the end of this message - if there's something in
particular that'd be useful to turn up, let me know.
I upgraded our dev & test to 3.7.0 today, am doing production
tonight. Also recreated the database from a backup, and am looking to
verify that all db changes made since the 3.6 upgrade made it into
fuseki.
For background, could you share a directory listing with files sizes?
total 2.8G
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 GOSP.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 GOSP.idn
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 GPOS.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 GPOS.idn
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 GSPO.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 GSPO.idn
-rw-rw-r--. 1 fuseki fuseki 0 Jun 12 15:27 journal.jrnl
-rw-r--r--. 1 fuseki fuseki 208M Jun 12 15:27 node2id.dat
-rw-r--r--. 1 fuseki fuseki 32M Jun 12 10:55 node2id.idn
-rw-r--r--. 1 fuseki fuseki 545M Jun 12 15:27 nodes.dat
-rw-r--r--. 1 fuseki fuseki 784M Jun 12 15:27 OSP.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 OSPG.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 OSPG.idn
-rw-r--r--. 1 fuseki fuseki 88M Jun 12 11:09 OSP.idn
-rw-r--r--. 1 fuseki fuseki 760M Jun 12 15:27 POS.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 POSG.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 POSG.idn
-rw-r--r--. 1 fuseki fuseki 88M Jun 12 11:09 POS.idn
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 prefix2id.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 prefix2id.idn
-rw-r--r--. 1 fuseki fuseki 0 Feb 18 11:30 prefixes.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 prefixIdx.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 prefixIdx.idn
-rw-r--r--. 1 fuseki fuseki 808M Jun 12 15:27 SPO.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 SPOG.dat
-rw-r--r--. 1 fuseki fuseki 8.0M Feb 18 11:30 SPOG.idn
-rw-r--r--. 1 fuseki fuseki 96M Jun 12 11:09 SPO.idn
-rw-r--r--. 1 fuseki fuseki 20K Feb 18 11:33 stats.opt
-rw-rw-r--. 1 fuseki fuseki 5 Jun 12 12:30 tdb.lock
When you restart - looks like that 10G is the mapped file space being
dropped. Mapping on-demand in chunks, so on restart it is very small
and grows over time. It should reach a steady state. It should not
cause swapping or GC.
Yes, I noticed that the server actually uses more Vsize than it's
using virtual memory (swap + ram), i figured it was something along
those lines. But when I referred to memory + swap used, i meant the
actual RSS as reported by ps, plus the inferred swap usage (swap
before and after fuseki restart).
I was running "ps" & "free" every couple minutes. As you can see
between 12:24 & 12:26 fuseki's memory usage skyrockets.
I've mildly edited the below but the numbers are all unmolested.
Tue Jun 12 12:18:01 EDT 2018
USER PID %CPU %MEM VSZ RSS TTY STAT
START TIME COMMAND
fuseki 32175 23.1 65.3 41186832 21496864 ? Sl Jun11
04:41:46 /etc/alternatives/java_sdk_1.8.0/bin/java -Xmx16G
-Dlog4j.configuration=file:/etc/archonnex/fuseki/log4j.properties [ gc
logging options here ] -jar
/usr/local/apache-jena-fuseki-3.6.0/fuseki-server.jar
--config=/etc/archonnex/fuseki/fcrepo.ttl
total used free shared buffers cached
Mem: 32877320 31168460 1708860 416 3184 961276
-/+ buffers/cache: 30204000 2673320
Swap: 27257848 3145708 24112140
[...]
Tue Jun 12 12:22:01 EDT 2018
fuseki 32175 23.1 66.5 41383440 21870824 ? Sl Jun11
04:43:32 /etc/alternatives/java_sdk_1.8.0/bin/java -Xmx16G
Mem: 32877320 31314128 1563192 488 1880 720456
-/+ buffers/cache: 30591792 2285528
Swap: 27257848 3145256 24112592
Tue Jun 12 12:24:01 EDT 2018
fuseki 32175 23.2 64.9 40859152 21352808 ? Sl Jun11
04:44:19 /etc/alternatives/java_sdk_1.8.0/bin/java -Xmx16G -
Mem: 32877320 31276020 1601300 504 2104 1231452
-/+ buffers/cache: 30042464 2834856
Swap: 27257848 3094076 24163772
Tue Jun 12 12:26:02 EDT 2018
fuseki 32175 23.3 82.6 49183252 27179308 ? Sl Jun11
04:46:21 /etc/alternatives/java_sdk_1.8.0/bin/java -Xmx16G
Mem: 32877320 32655256 222064 476 1516 25612
-/+ buffers/cache: 32628128 249192
Swap: 27257848 8361760 18896088
Tue Jun 12 12:28:01 EDT 2018
fuseki 32175 23.5 71.6 48702204 23540952 ? Sl Jun11
04:49:44 /etc/alternatives/java_sdk_1.8.0/bin/java -Xmx16G
Mem: 32877320 30432416 2444904 484 2132 239924
-/+ buffers/cache: 30190360 2686960
Swap: 27257848 10598088 16659760
Java monitoring of the heap size should show the heap in use after a
major GC to be a different, smaller size.
Yesterday I fixed the garbage collection logging. I looked at it with
gceasy.io; There is nothing horribly wrong there. Heap doesn't go
above 7GB, even when things went to hell. heap usage did increase
significantly at the time of the problems - note the repeated Full GC's.
2018-06-12T12:22:28.779-0400: 73381.748: [GC (System.gc())
Desired survivor size 134742016 bytes, new threshold 9 (max 15)
[PSYoungGen: 3068955K->7622K(5460480K)] 3087627K->26295K(6654976K),
0.0068808 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2018-06-12T12:22:28.786-0400: 73381.755: [Full GC (System.gc())
[PSYoungGen: 7622K->0K(5460480K)] [ParOldGen:
18672K->24964K(1194496K)] 26295K->24964K(6654976K), [Metaspace:
34037K->34037K(1081344K)], 0.1054190 secs] [Times: user=0.57 sys=0.00,
real=0.10 secs]
2018-06-12T12:23:22.592-0400: 73435.562: [GC (System.gc())
Desired survivor size 130547712 bytes, new threshold 8 (max 15)
[PSYoungGen: 2440898K->2816K(5455872K)] 2465863K->27780K(6650368K),
0.0037102 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2018-06-12T12:23:22.596-0400: 73435.566: [Full GC (System.gc())
[PSYoungGen: 2816K->0K(5455872K)] [ParOldGen:
24964K->27048K(1194496K)] 27780K->27048K(6650368K), [Metaspace:
34037K->34037K(1081344K)], 0.1114969 secs] [Times: user=0.61 sys=0.00,
real=0.11 secs]
2018-06-12T12:24:02.404-0400: 73475.374: [GC (Allocation Failure)
Desired survivor size 175112192 bytes, new threshold 7 (max 15)
[PSYoungGen: 5324288K->127456K(5377536K)] 5351336K->201416K(6572032K),
0.1020528 secs] [Times: user=0.66 sys=0.00, real=0.10 secs]
2018-06-12T12:24:29.348-0400: 73502.318: [GC (System.gc())
Desired survivor size 193986560 bytes, new threshold 6 (max 15)
[PSYoungGen: 880066K->129888K(5380096K)] 954027K->203848K(6574592K),
0.0642832 secs] [Times: user=0.33 sys=0.00, real=0.06 secs]
2018-06-12T12:24:29.412-0400: 73502.382: [Full GC (System.gc())
[PSYoungGen: 129888K->0K(5380096K)] [ParOldGen:
73960K->196536K(1194496K)] 203848K->196536K(6574592K), [Metaspace:
34037K->34037K(1081344K)], 0.3551479 secs] [Times: user=1.78 sys=0.00,
real=0.35 secs]
2018-06-12T12:27:48.073-0400: 73701.045: [GC (System.gc())
Desired survivor size 186646528 bytes, new threshold 5 (max 15)
[PSYoungGen: 2862549K->16720K(5409792K)] 3059085K->213256K(6604288K),
2.1344761 secs] [Times: user=1.07 sys=0.09, real=2.13 secs]
2018-06-12T12:27:50.210-0400: 73703.179: [Full GC (System.gc())
[PSYoungGen: 16720K->0K(5409792K)] [ParOldGen:
196536K->206591K(1194496K)] 213256K->206591K(6604288K), [Metaspace:
34037K->34037K(1081344K)], 2.9111523 secs] [Times: user=2.51 sys=0.09,
real=2.91 secs]
If that is not how it is, there is something to investigate.
Andy
>
> thanks
> danno
Dan Pritts <mailto:da...@umich.edu>
June 11, 2018 at 5:28 PM
Hi all,
we've been having trouble with our production fuseki instance. a few
specifics:
fuseki 3.6.0, standalone/jetty. OpenJDK 1.8.0.171 on RHEL6. On an
m4.2xlarge, shared with two other applications.
we have about 21M triples in the database. We hit fuseki medium
hard, on the order of 1000 hits per minute. 99%+ of the hits are
queries. Our code could stand to do some client-side caching, we get
lots of repetitive queries. That said, fuseki is normally plenty
fast at those, it's rare that it takes >10ms on a query.
It looks like i'm getting hit by JENA-1516, I will schedule an
upgrade to 3.7 ASAP.
The log is full of errors like this.
[2018-06-11 16:15:07] BindingTDB ERROR get1(?s)
org.apache.jena.tdb.base.file.FileException:
ObjectFileStorage.read[nodes](488281706)[filesize=569694455][file.size()=569694455]:
Failed to read the length : got 0 bytes
at
org.apache.jena.tdb.base.objectfile.ObjectFileStorage.read(ObjectFileStorage.java:341)
[2018-06-11 16:15:39] BindingTDB ERROR get1(?identifier)
org.apache.jena.tdb.base.file.FileException: In the middle of an
alloc-write
at
org.apache.jena.tdb.base.objectfile.ObjectFileStorage.read(ObjectFileStorage.java:311)
at
org.apache.jena.tdb.base.objectfile.ObjectFileWrapper.read(ObjectFileWrapper.java:57)
at org.apache.jena.tdb.lib.NodeLib.fetchDecode(NodeLib.java:78)
The problem that got me looking is that fuseki memory usage goes
nuts, which causes the server to start swapping, etc. Swapping =
slow = pager. Total memory + swap in use by fuseki when I
investigated was about 32GB; It's configured to use a 16GB heap.
Garbage collection logging was not configured properly, so I can't
say whether my immediate problem was heap exhaustion.
I'm monitoring swap usage hourly - sometime in a <1hr timeframe the
swap usage increased past 2GB (10%) to about 11GB (10 of which was
cleared after I restarted fuseki). So the memory ballooned fairly
quickly when it happened.
The TDB errors happen much earlier than that memory goes nuts.
Obviously, could be a delayed effect of this problem, but I'm wondering:
- if this rings a bell in some other way - how much memory should I
expect fuseki to need?
- if there is any particular debugging I should enable
- if our traffic level is out of the ordinary
thanks
danno
Dan Pritts <mailto:da...@umich.edu>
June 11, 2018 at 5:28 PM
Hi all,
we've been having trouble with our production fuseki instance. a few
specifics:
fuseki 3.6.0, standalone/jetty. OpenJDK 1.8.0.171 on RHEL6. On an
m4.2xlarge, shared with two other applications.
we have about 21M triples in the database. We hit fuseki medium hard,
on the order of 1000 hits per minute. 99%+ of the hits are queries.
Our code could stand to do some client-side caching, we get lots of
repetitive queries. That said, fuseki is normally plenty fast at
those, it's rare that it takes >10ms on a query.
It looks like i'm getting hit by JENA-1516, I will schedule an upgrade
to 3.7 ASAP.
The log is full of errors like this.
[2018-06-11 16:15:07] BindingTDB ERROR get1(?s)
org.apache.jena.tdb.base.file.FileException:
ObjectFileStorage.read[nodes](488281706)[filesize=569694455][file.size()=569694455]:
Failed to read the length : got 0 bytes
at
org.apache.jena.tdb.base.objectfile.ObjectFileStorage.read(ObjectFileStorage.java:341)
[2018-06-11 16:15:39] BindingTDB ERROR get1(?identifier)
org.apache.jena.tdb.base.file.FileException: In the middle of an
alloc-write
at
org.apache.jena.tdb.base.objectfile.ObjectFileStorage.read(ObjectFileStorage.java:311)
at
org.apache.jena.tdb.base.objectfile.ObjectFileWrapper.read(ObjectFileWrapper.java:57)
at org.apache.jena.tdb.lib.NodeLib.fetchDecode(NodeLib.java:78)
The problem that got me looking is that fuseki memory usage goes nuts,
which causes the server to start swapping, etc. Swapping = slow =
pager. Total memory + swap in use by fuseki when I investigated was
about 32GB; It's configured to use a 16GB heap. Garbage collection
logging was not configured properly, so I can't say whether my
immediate problem was heap exhaustion.
I'm monitoring swap usage hourly - sometime in a <1hr timeframe the
swap usage increased past 2GB (10%) to about 11GB (10 of which was
cleared after I restarted fuseki). So the memory ballooned fairly
quickly when it happened.
The TDB errors happen much earlier than that memory goes nuts.
Obviously, could be a delayed effect of this problem, but I'm wondering:
- if this rings a bell in some other way - how much memory should I
expect fuseki to need?
- if there is any particular debugging I should enable
- if our traffic level is out of the ordinary
thanks
danno