Paul,
The default heap size is quite small (1.2G) because it has to work for
32-bit systems as well. Set JVM_ARGS if using the fuseki-server script
or run the server directly with java --jar.
(Hmm - the "fuseki" script does
JAVA_OPTIONS+=("-Dlog4j.configuration=log4j.properties" "-Xmx1200M")
so it is not checking if -Xmx is already set.
)
but the bulk loader, with it's file manipulation (and very
non-transactional!) tricks is significantly faster.
Andy
On 02/11/12 23:45, Rob Vesse wrote:
In the meantime you might want to try using tdbloader/tdbloader2
(http://jena.apache.org/documentation/tdb/commands.html#tdbloader2) to
create the TDB dataset offline instead
You can then start up a Fuseki server and connect to the TDB dataset you
created
Rob
On 11/2/12 3:41 PM, "Stephen Allen" <[email protected]> wrote:
Hi Paul,
Thanks for the report. This is a known issue in Fuseki (see JENA-309
[1]). I have plans to work on this soon. Also I'm a little surprised
that your second attempt after breaking it into chunks failed, I'll
take a look at that.
I am also working on a related issue (JENA-330 [2]) that will
eliminate limits on SPARQL Update queries. I hope to have that
checked into the trunk soon.
-Stephen
[1] https://issues.apache.org/jira/browse/JENA-309
[2] https://issues.apache.org/jira/browse/JENA-330
On Fri, Nov 2, 2012 at 5:24 PM, Paul Gearon <[email protected]> wrote:
This is probably pushing Jena beyond it's design limits, but I thought
I'd
report on it anyway.
I needed to test some things with large data sets, so I tried to load
the
data from http://basekb.com/
Once extracted from the tar.gz file, it creates a directory called
baseKB
filled with 1024 gzipped nt files.
On my first attempt, I grabbed a fresh copy of Fuseki 0.2.5 and started
it
with TDB storage. I didn't want to individually load 1024 files from the
control panel, so I used zcat to dump everything into one file and tried
loading from the GUI. This failed in short order with RIOT complaining
of
memory:
13:24:31 WARN Fuseki :: [1] RC = 500 : Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:234)
at java.lang.StringBuilder.toString(StringBuilder.java:405)
at
org.openjena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:476)
...etc...
I'm wondering if RIOT really needed to run out of memory?
Anyway, I went back to the individual files. That meant using a non-gui
approach. I wasn't sure about using a media type for nt, but that's
compatible with Turtle, so I used test/turtle.
I threw away the DB directory and started again. This time I tried to
load
the files with the following bash:
for i in *.nt.gz; do
echo "Loading $i"
zcat $i | curl -x POST -H "Content-Type: text/turtle" --upload-file -
"
http://localhost:3030/dataset/data?default"
done
This started reasonably well. A number of warnings showed up on the
server
side, due to bad language tags and invalid IRIs, but it kept going.
However, on the 20th file I started seeing these:
Loading triples0000.nt.gz
Loading triples0001.nt.gz
Loading triples0002.nt.gz
Loading triples0003.nt.gz
Loading triples0004.nt.gz
Loading triples0005.nt.gz
Loading triples0006.nt.gz
Loading triples0007.nt.gz
Loading triples0008.nt.gz
Loading triples0009.nt.gz
Loading triples0010.nt.gz
Loading triples0011.nt.gz
Loading triples0012.nt.gz
Loading triples0013.nt.gz
Loading triples0014.nt.gz
Loading triples0015.nt.gz
Loading triples0016.nt.gz
Loading triples0017.nt.gz
Loading triples0018.nt.gz
Loading triples0019.nt.gz
Error 500: GC overhead limit exceeded
Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
Loading triples0020.nt.gz
Error 500: GC overhead limit exceeded
Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
Loading triples0021.nt.gz
Error 500: GC overhead limit exceeded
Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
This kept going until triples0042.nt.gz where it hung for hours.
Meanwhile, on the server, I was still seeing parser warnings, but also
messages like:
17:01:26 WARN SPARQL_REST$HttpActionREST :: Transaction still active in
endWriter - no commit or abort seen (forced abort)
17:01:26 WARN Fuseki :: [33] RC = 500 : GC overhead limit
exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
When I finally killed it (with ctrl-C), I got several stack traces in
the
stdout log. They appeared to indicate a bad state, so I've saved them
and
put them up at: http://pastebin.com/yar5Pq85
While OOM is very hard to deal with, I'm still surprised to see it hit
this
way, so I thought you might be interested to see it.
Regards,
Paul Gearon